BEYOND DES: DATA COMPRESSION AND THE MPJ ENCRYPTION ALGORITHM by Michael Paul Johnson B. S., University of Colorado at Boulder, 1980 A thesis submitted to the Faculty of the Graduate School of the University of Colorado in partial fulfillment of the requirements for the degree of Master of Science Department of Electrical Engineering 1989 Copyright (C) 1989 Michael Paul Johnson. All Rights Reserved. This thesis for the Master of Science degree by Michael Paul Johnson has been approved for the Department of Electrical Engineering by Mark A. Wickert Charles E. Fosha, Jr. Rodger E. Ziemer Date Johnson, Michael Paul (M. S., Electrical Engineering) Beyond DES: Data Compression and the MPJ Encryption Algorithm Thesis directed by Assistant Professor Mark A. Wickert Many encryption algorithms have come and gone as cryptography, cryptanalysis, and technology have progressed. Today's communication and computer technologies need cryptography to truly secure data in many applications. The demands on the cryptography needed for some commercial applications will exceed the security offered by the National Bureau of Standards Data Encryption Standard (DES) in the near future due to advances in technology, advances in cryptanalysis, and the increasing rewards for breaking such a heavily used algorithm. To meet part of this need, a new block encryption algorithm is proposed. A Pascal program to implement this algorithm is given. One way to further increase security of encrypted data, as well as to achieve storage and/or transmission economy, is by redundancy reduction prior to encryption. A linguistic approach to redundancy reduction, together with an example computer program to implement it, is given for this purpose. LIST OF FIGURES viii I. INTRODUCTION 1 A. Motivation 2 B. Approach 4 II. HISTORY OF CRYPTOGRAPHY 6 III. ELEMENTS OF ENCRYPTION 8 A. Substitution 8 1. Monoalphabetic 8 2. Polyalphabetic 10 B. Permutation 11 C. Noise Addition 12 D. Feedback & Chaining 13 1. Plain Text Feedback 13 2. Cipher Text Feedback 14 E. Analog Encryption 15 IV. FACTORS RELATED TO ENCRYPTION 17 A. Change of Language 17 B. Digitization 18 C. Compression 18 D. Multiplexing 19 V. COMPARISON OF SELECTED ALGORITHMS 20 A. One-Time Key Tape 20 B. Linear Shift Register Feedback 21 C. Exponential Encryption 22 D. Knapsack 24 E. Rotor Machines 24 F. Codes 27 G. Galois Field and Hill Cryptosystems 28 H. DES 30 VI. DESIGN CONSIDERATIONS FOR MPJ 33 A. Strength Based on Key 33 B. Usability of Random Keys 33 C. Key Length & Block Size 34 D. Effort Required to Break 34 E. Computational Efficiency 35 F. Communication Channel Efficiency 36 G. No Back Doors or Spare Keys 36 VII. MPJ ENCRYPTION ALGORITHM 37 A. Description 37 1. Overall Structure of MPJ 37 2. Substitution Boxes 39 3. Wire Crossings 39 4. Key Generation 40 B. Implementation in Pascal 44 1. Exceptions from Standard Pascal 44 2. Main Program 45 3. Procedures Encrypt & Decrypt 46 4. Procedures Permute & Ipermute 47 5. Procedures Substitute & Isubst 47 C. Implementation in Hardware 47 D. Strengths & Weaknesses 48 VIII. DATA COMPRESSION 50 A. Purpose 50 B. Linguistic Parsing 53 C. Huffman Coding 55 D. Pascal Programs 55 IX. CONCLUSION 58 REFERENCES 59 APPENDIX A. MPJ in Pascal 64 B. Linguistic Data Compression Programs 72 FIGURE III.B Permutation. 11 III.C. Noise Addition 12 III.D. Block Cipher Modes. 14 III.E. Analog Time & Frequency Encryption 15 V.A. One-time key tape (AKA One-time pad) 21 V.A.1 Typex Rotor Machine 25 V.E.2. Wired Rotors 25 V.H.1. DES Enciphering 30 V.H.2. DES Nonlinear Function 30 V.H.3. DES Internal Key Generation 31 VII.A.1. Overall Structure of MPJ 38 VII.A.3. Wire Crossings 39 I. INTRODUCTION The increasing proliferation of digital communication and computer data base storage has brought with it increasing difficulty of maintaining the privacy of that data. There is only one effective way to protect the privacy of communications sent over such channels as satellites, terrestrial microwave, and cellular telephones. This is by encryption. It is clearly impossible to deny unauthorized access by a determined and knowledgeable interceptor to the communications, but it is possible to render the communications totally unintelligible to all but the intended receiver(s). There are many ways to reversibly transform data from its plain form to something that looks unintelligible, but many of these can be figured out (broken) by someone else. The study of how to hide secrets is cryptology. Trying to figure out the secrets that someone else has hidden is cryptanalysis. These two sciences are, of course, very much intertwined. History reveals many examples of cryptology that worked, and that didn't [KAH]. Successful cryptanalysis depends on taking advantage of as many of the following as are available to the cryptanalyst: (1) taking advantage of the redundancy in any natural language to determine the validity of assumptions, (2) clues gained from corresponding plain and cipher text, (3) information that might be known about the algorithm(s) used, (4) the general expected content of the cryptograms, (5) all of the cipher text that is available in the same system & key, (6) compromised keys, (7) as much computational and analytical power as can be obtained, and (8) mistakes made in the users of the cryptographic system. The cryptographer can make life as difficult as possible for the cryptanalyst by depriving him of some of these things by (1) using redundancy reduction before encryption, (2) using an algorithm that is resistant to the known plain text attack, (3) & (4) using a strong enough algorithm that these clues aren't really useful, (5) changing keys often and selecting them properly, (6) guarding keys as closely as the data they protect justifies, (7) ensuring that there aren't enough computers in the world to do a brute force attack on the algorithm, and (8) making sure users of the system understand how to properly use it. This thesis proposes one solution to the above challenge by proposing (1) an approach to linguistically based redundancy reduction, and (2) proposing a new data encryption algorithm (MPJ) that can be used where DES is in use now, but is more secure. A. Motivation There has been a great deal of discussion of the security of DES in the open literature. Most of it has been favorable to DES [MEY], but there are a few indications that it would be wise to supplement the aging DES and eventually replace it. DES has been in use since 1977, and has been used in a large number of applications where people have had many possible motivations for trying to break this cipher. During this time, it is possible that someone has discovered a computationally feasible method for doing so [KRU]. Under such circumstances, it is highly unlikely that such a discovery would be made known. There is no way to really know this for sure, unless you are one of the ones who has broken DES. The closest thing that I have found to an open admission of breaking DES is a story of the FBI successfully decrypting a file of drug transaction records that were encrypted on a PC using a DES board. [MCL] The DES board that the criminal used has an algorithm to generate a key from a word or phrase. By an exhaustive search of an English dictionary and key names from the criminal's family & friends using a supercomputer, the file was solved. This indicates some weakness in DES, but even more weakness in the way the key was chosen. The key was not chosen randomly, so the FBI's job was much easier. DES is subject to attacks that require precomputation that could tie up a supercomputer for a few years, after which it would take only a few days to solve a DES cryptogram. This is becoming less of a barrier as the price of computers drops and the speed and storage capacity of computers increase. The U. S. National Security Agency (NSA) is acting to release some of its own algorithms for ``sensitive but unclassified'' applications, such as communications between government contractors. The NSA also releases some classified algorithms to chip manufacturers for use in classified devices, but under strict controls [ULT]. This will take some of the load off of the DES algorithm, but there is a catch. They intend to keep the algorithms secret and control the keys. Since the security of the algorithm is dependent on the key in a way that only the NSA knows, this gives the NSA the exclusive ability to read everybody's' communications. That is not all bad, as it discourages the use of one of those systems by someone engaged in spying or other illegal activity. Unfortunately, these systems are not an option for protection of a corporations proprietary data that may have a great deal to do with profits and losses, but are not really in the domain of the NSA. There are other alternatives to DES now, but none of them in the public domain are even as good, let alone better, for general cryptography. For example, RSA encryption has great advantages in the authentication of digital signatures, but the complications of selecting good keys and the fact that the security of RSA relies heavily on the failure of the state of the art in mathematics to progress makes it at least inconvenient to use and at most insecure. Because of the above considerations, it is my intentions to suggest a better algorithm (MPJ) for general use in the private sector. Although the MPJ algorithm might be useful for government applications, too, the design requirements for the two areas of application vary [CHA] and the latter is kept shrouded in secrecy, so we shall leave it to the NSA. B. Approach First, the problem is defined. As explained in the above section, the basic problem is to create an algorithm to supplement DES. Design criteria are then conceived such that an algorithm that meets those criteria will solve the problem as defined. The design criteria chosen for the MPJ algorithm are discussed in chapter VI. To avoid repetition of one or more of the many mistakes that have been made throughout history with cryptography, it is, of course, necessary to research what was done in the past, both distant and recent. From this, many good ideas can be gleaned that apply to the problem at hand. These ideas, along with a knowledge of current technology are then applied to design criteria with a bit of creativity and a lot of hard work to define the new algorithm. In the process of doing all of the above, it became apparent that the security of encrypted data was vastly improved by first removing as much redundancy from the data as possible. Therefore, as sort of a bonus, a linguistic approach to data compression is also presented that can either be used in conjunction with the MPJ encryption algorithm, another encryption method, or just by itself for the savings that it gains in communications channel capacity and/or data storage space. II. HISTORY OF CRYPTOGRAPHY One of the best works on the history of cryptography is The Codebreakers, by David Kahn [KAH]. In that book, David Kahn discusses cryptography from the prehistory to World War II. The first codes and ciphers were in written form, and used to protect the privacy of communications sent by courier or mail through hostile or unknown territory. Some of these were reasonably good, but most were not difficult to break using manual methods, provided that the interceptor had sufficient cipher text and perhaps some probable text. The use of radio, especially by the military, increased the need for cryptography, as well as increasing the rewards for those who could break the encryption schemes in use. Kahn's documentation of the efforts of those who broke some very complex encryption schemes, like the German Enigma and the Japanese Purple ciphers, lend great insight to the kind of process cryptanalysis really is. Kahn points out the kinds of mistakes the inventors and users of cryptographic algorithms tend to make that reduce the security of their communications. For example, German users of Enigma tended to choose a three-letter indicator for their messages that consisted of three consecutive letters on the keyboard. This substantially reduced the number of keys that had to be searched to determine the one that they were using. While the designer of an algorithm may calculate the great number of combinations of keys that there are, the cryptanalyst looks at ways to isolate parts of the key so that the difficulty of a solution is much less than the size of the key space indicates. The difference in mind set between the concealer of secrets and the one who prys into them has caused many an inventor of an encryption algorithm to be overconfident. The job of the cryptanalyst is a tedious one. He tries all kinds of things to try to unscramble the cipher text in front of him. Sometimes the search is fruitless. Sometimes the search yields something that looks like a meaningful language. It is this ability to recognize a meaningful message when it comes out of the various operations that the cryptanalyst tries that makes the whole process possible. It is also helpful for the cryptanalyst to know some probable plain text that is contained in a message. This is almost always the case. For example, military messages even now have a very stereotyped format, with the from, to, and date indicators in the same places in the message. The cryptanalyst almost always knows what language to expect a message to be written in, and this is a great help. Natural languages contain a great deal of redundancy. A message that is only 90% recovered is usually readable. Natural languages also have very consistent statistical properties that are very useful in cryptanalysis, especially when the cryptanalysis is automated. The only time that these things don't help the cryptanalyst is in the ``one-time pad.'' III. ELEMENTS OF ENCRYPTION Several basic elements make up the basis of a multitude of possible encryption algorithms, either alone or in combination [BEK][BAL]. Although most of these elements may be used to form the basis of an encryption method all by itself, the most secure methods of encryption will several of them. For example, substitution and permutation are basic elements of both DES and MPJ encryption algorithms. These algorithms, in turn, are best used with some form of feedback. A. Substitution A substitution cipher simply substitutes one symbol from the plain text alphabet for another one from a cipher text alphabet. The plain text alphabet can be a natural language alphabet, ASCII, EBCDIC, Baudot, the set {0,1}, or any other finite set. Cipher text alphabets, likewise, may be any finite set. To be able to easily apply computer methods to the manipulation of these alphabets, it is preferable to use alphabets that are groups of binary digits (bits). This is not a severe limitation, since a correspondence can be set up between any finite set and a set of binary numbers. 1. Monoalphabetic A monoalphabetic substitution cipher is one in which each letter of the plain text alphabet is always substituted for by the same cipher text alphabet. The cipher text alphabet need not be in any particular order. A subset of the monoalphabetic substitution cipher is the Caesar cipher. This cipher replaces each letter with a letter that is n letters later in the alphabet. This cipher is so named because it was reportedly first used by Caesar. [KAH] With only 26 possible keys, this is obviously not very secure. In fact, any substitution cipher that substitutes one letter for another is vulnerable to solution by frequency analysis, and can be solved, given enough cipher text with the same key. How much cipher text is enough depends on the nature of the plain text, but for plain English, as many characters as there are letters in the alphabet is sufficient for the probabilities of occurrences of letters to betray enough letter identities to make solution probable. While ciphers of this type no longer have any value for serious cryptography, they do make fun puzzles for elementary school children. For example, the following cryptogram is presented for your solution: UIJT JT B DBFTBS TVCTUJUVUO DJQIFS/ XPVME ZPV USVTU VPVS TFDSFST UP JU@ Substitution ciphers can be made more secure by operating on substitutions of more than one letter at a time. For example, the substitution could be made on digraphs or trigraphs (groups of two or three letters). A more practical example is the Data Encryption Standard. DES is actually a substitution cipher with an effective alphabet size of 2^64, or about 1.84 x 10^19, and the MPJ Encryption Algorithm is a substitution cipher with an effective alphabet size of 2^128 or about 3.4 x 10^38. Not only is it difficult to get that large of a sample to analyze for statistics, but the memory required to store the substitution table is impractical even for computer systems using large optical storage disks. 2. Polyalphabetic A polyalphabetic substitution cipher is one in which each letter of the plain text alphabet may be replaced a letter from one of many cipher alphabets. The selection of which alphabet to take the substitution from may be determined in many ways. The most usual method is by selecting the cipher alphabet by a combination of a message indicator and the number of characters from the beginning of the message. The classic example of this type of cipher is the rotor-based machine. The substitution occurs by physical wire crossings within rotors. After each character is enciphered, one or more rotors are rotated by a ratchet mechanism. The message indicator, which is part of the key, determines the starting position of the rotors. This effectively allows the use of more cipher alphabets than there are letters in the message. Because of this, the use of statistical analysis on any one message to guess at the substitutions is useless. There is, however a good way to attack this kind of cipher when it is used heavily. This is by analyzing the statistics of the first character in each message, then the second, etc. This is called analyzing the messages ``in depth.'' Although the rotor-based polyalphabetic ciphers used by Germany and Japan in World War II contributed greatly to their losses because their messages were read regularly by the allies, it is possible to create a more secure polyalphabetic substitution with computer technology. The primary weakness of the mechanical rotor machines was the infrequent changing of many of the parts of the key such as rotor wiring and ratchet construction. A more general method of using a different set of alphabets for each message would be very secure. Unfortunately, it would result in keys larger than the message. Since the minimum key length for absolute, provable cryptographic security, as determined by C. E. Shannon [SHA], is exactly the length of the message, this solution is not very efficient with respect to key management. B. Permutation Permutation is taking the same symbols and writing them in a different order to confuse an interceptor. There are many ways of doing this geometrically. For example, words may be written vertically then transcribed horizontally, or they may be written on a strip of paper wrapped along the length of a rod of a given diameter, then unwrapped for delivery. In general, a permutation operation may be considered as an operation on a block of symbols, where each symbol's position is changed to another position within the block according to some rule. The blocks may overlap each other if the permutation is undone from the end of the message to the beginning. The rule(s) determining how the permutation is done may be fixed, or they may depend in some way on a key. Although historical permutation ciphers tended to be rather simple, computer methods allow them to be potentially rather complex, especially if the symbols that are scrambled are individual bits. Permutation, when used in combination with substitution, is very useful in further increasing the complexity of a cipher, since the two algorithms act in different ways to baffle the interceptor. This is significant, since it is not possible to actually increase the security of a substitution cipher by simply repeating the same thing with a different key. The result in such a case would be another substitution cipher with the same complexity. C. Noise Addition Noise addition can either be a way of enhancing another encryption scheme, or of hiding the very existence of a message. For example, a grid can be set up on a piece of paper where only certain positions are significant. These positions are then concatenated together to reveal the secret message. The rest of the paper is then filled with something that uses those characters, but contains something different, like a letter to someone's mother. The same thing can be done electronically, by defining a block of bits with only part of the bits significant. The rest of the bits are then filled with truly random noise, like the output of a Geiger counter. This method can be very effective, especially if the position of the information-bearing bits vary from block to block in a pseudorandom manner, and if there is a relatively large proportion of truly random noise. The obvious disadvantage to this kind of scheme is that it makes very poor use of communications channels and data storage facilities. One useful variation on the noise addition method is to multiplex several encrypted streams of data (encrypted with another method) together with some pseudorandom interleaving. Since the other encrypted streams of data act like the random noise added to any one stream of data, this is a good way to improve security when a large volume of encrypted data must be sent through the same channel. D. Feedback & Chaining A block cipher like DES or MPJ, when applied directly to an input file with a highly repetitive structure (i. e., lots of spaces between columns) will also display some of that structure. Although it may not be possible to determine the exact contents of what has been encrypted, it may be rather easy to determine something about the nature of what has been encrypted. To deny the interceptor even this information, and to further increase the complexity of the encrypted information, feedback and chaining may be used. Chaining refers to making the results of the encryption of one block dependent on previous results. This is usually done in one of two ways. One is by adding the plain text from the last block to the cipher text of this block, modulo two. The other is by adding the cipher text from the preceding block to the cipher text of the current block, modulo two. These are referred to as plain text feedback and cipher text feedback, respectively. 1. Plain Text Feedback Plain text feedback has the property that any errors in transmission get propagated clear through the rest of the message. This may be a desirable property if it is necessary to detect any tampering with the message, but it makes it difficult to use real (error-prone) channels. If it is necessary to have the property of error propagation to detect tampering and to use real channels, then the message should be transmitted using an error detection and correction protocol of some sort. 2. Cipher Text Feedback Cipher text feedback also causes some error propagation, but not clear through to the end of a message. An error in one block will cause an error in that block in the bit positions where the error occurred and in all of the next block, but the receiving station can recover all subsequent error-free blocks. This method is a good compromise between security and error recovery. It is still desirable to wrap encrypted data in an error detection and correction protocol to avoid having a one bit error wipe out many bits. E. Analog Encryption There are several methods of analog encryption. None of them are as secure as digital methods can be. Analog encryption is generally based on spectrum inversion, spectrum scrambling, time slice scrambling, and (for video signals) suppression of synchronization information. These elements are also used in combination [LOD]. These methods are commonly used by cable TV companies (and sometimes on satellite channels) to ensure that people only get the programming that they have paid for. Articles on how to build devices to defeat some of these things appear periodically in electronic hobbyist magazines. The state of the art in current use for protection of satellite TV signals is the General Instruments Video Cipher II. This device used digital (DES) encryption of the audio information and a partially analog encryption of the video signal. This device is a deterrent to those who receive satellite TV without paying a cable TV operator, but it has been broken by a hacker who figured out that it was possible to modify the microcode in the device so that a key purchased for one machine would work on all of them with modified code. This is, of course, illegal, but it allows one person to pay for a service and many people to use it for free [DOH]. The problem here is not one of the encryption algorithm, but in the key management. IV. FACTORS RELATED TO ENCRYPTION A. Change of Language Although it is not really encryption, the use of a different language from what the interceptor is likely to know does increase the difficulty of cryptanalysis. For example, I know that it would be far more difficult for me to solve even a simple Caesar substitution cipher in Chinese than in English, Spanish, or German. This principal was used effectively by the United States against Japan in World War II by using the Navajo language for spoken radio communications [KAH]. Since the Navajo language had no written form, and was only spoken by a small group of Native Americans, there were no Japanese who knew the language. The Japanese never did figure that ``code'' out. Thus, the Navajo ``Code Talkers'' made a great contribution to their country. Part of the reason for the success of the Navajo Code Talkers was that the Japanese had no idea what was going on. Since this success is now well known, it is difficult to determine the potential success of a repeat of this kind of approach. There are still some languages that are spoken only by a very few people in the world, some of which have no written form. These could be used to increase the effective ``key space.'' Although a change of language was a great victory for the Navajo People and the United States of America during World War II, it is difficult to adapt this success to computerized communication methods. B. Digitization The very act of digitizing an audio or video signal provides some protection from the casual eavesdropper. Although this provides no protection against a determined snooper, it is an appropriate level of additional security for protecting the privacy of things like mobile telephone conversations. Of course, once an analog signal is digitized, there is a wide range of digital encryption methods available. Of course, some methods would be overkill, since the same communication would probably be transmitted in the clear over wire and/or microwave channels as well as by radio between the car and the stationary cellular transceiver. Although there has been some ill-conceived efforts by the mobile telephone industry to legislate privacy for such applications by saying that it is illegal to listen to someone else's phone calls around 800 MHz (where cellular phones usually operate), this kind of thing is unenforceable. Worse yet, such legislation provides a false sense of security in a world where scanners and other receivers that operate in that range are readily available. C. Compression Data compression itself tends to obscure things by taking, for example, and ASCII text file in English and reducing it to a binary file that is not as easy to read. There is a more important effect of data compression, however. The removal of the natural redundancy of the plain text data before encrypting it with some kind of encryption algorithm makes it much more difficult to cryptanalyze the cipher text. In the extreme case where all of the redundancy of a message is removed, all messages of a given length would be meaningful, and it would be impossible for the cryptanalyst to determine which one was intended. D. Multiplexing Multiplexed signals are less readable to the casual observer, but standard multiplexing offers no real increase in cryptographic strength. It can, however, increase the strength of multiple streams of encrypted data when the multiplexing is done with a pseudorandom interleave (as discussed above under noise addition). This makes the multiplexing and demultiplexing processes more complex with respect to synchronization and timing, and is likely to introduce more delay into the system than more straight forward multiplexing arrangements. V. COMPARISON OF SELECTED ALGORITHMS A. One-Time Key Tape The one-time key tape, also called the one-time pad, has a tremendous advantage. It is the only commonly used encryption method that is provably secure. Proving the security of most other systems generally reduces to an impossible negative proof <197> an attempt to prove the lack of a method to defeat it. The way the one-time key tape works is to add each character of the message to a character of the key modulo the alphabet size. When working with any binary stream of data, this is easy to implement by exclusive-oring the input stream with the key stream. At the receiving end, the same thing is done to decrypt the data. Since P + K + K = P (modulo 2), the original stream is recovered. As implied by the name, it is important that each key be used only once. If it were used more than once, then the system would be vulnerable to attack with the known cipher text and corresponding plain text attack. The key is obtained from the boolean identity: K = (P + K) + P, where K is the key, P is the plain text, P + K is the cipher text, and all additions are modulo the alphabet size (usually 2). The key thus recovered would then be used to decipher the next message that used it. According to C. E. Shannon [SHA], for a cipher to be provably secure, the number of keys must be as great as the number of potential messages. The number of keys he refers to, of course, is the number of keys that yields a distinctly different transformation of the message text into cipher text. This method of encryption is provably secure, since an exhaustive search of the keys applied to any cipher text of a given length will yield all possible plain text messages of the same length. Since there are many messages of the same length that make sense, many of which have totally contradictory or unrelated meanings, the cryptanalyst has no way of knowing which one was intended unless he has the key. Naturally, any method of encryption this secure has a price. The price is the sheer volume of keying material that must be kept secure and managed. For any communication network or data storage system that handles large amounts of data, key management becomes a nightmare with this system. B. Linear Shift Register Feedback Linear shift registers with selected feedback taps are commonly used to generate pseudorandom sequences for such things as spread spectrum communications. They could be (and are, in some cases) applied to cryptography by exclusive-oring the pseudorandom stream with the plain text stream. The main weakness to this approach is the cipher text with corresponding plain text attack. If the cryptanalyst obtains the plain text that came from a certain cipher text, then he can recreate a portion of the pseudorandom stream by exclusive-oring the two together. The linear shift register feedback function can then be expressed as a system of linear boolean equations that will solve the portion of the cycle that was received. A solution of this system of equations is possible with only enough bits to be a few times as long as the shift register at the most [DEN]. Since this amount of plain text required to break the cipher is so much less than the length of the pseudorandom sequence, it is likely that this solution can then be applied to additional cipher text to recover it, too. C. Exponential Encryption The classic algorithm of this type is the RSA algorithm, which is named after the initials of its creators (R. L. Rivest, A. Shamir, and L. Aldeman). The security of this algorithm rests on the fact that even with supercomputers, it is very difficult to factor the product of two very large prime numbers. The RSA algorithm has a unique property in that the key has two parts, one of which may be made public. This makes this algorithm very well suited to the use of digital signatures. The RSA algorithm defines the relationships between the plain text message, the cipher text, and the elements of the key as follows [JAC]: C = P^e mod m P = C^d mod m m = pq where C = Cipher text, P = Plain text, e = encryption key, d = private decryption key, m = public modulus, and p and q are randomly chosen large (> 150 digits each) prime numbers. The receiver chooses a private key d and calculates a public key e. d must be relatively prime to (p - 1) and (q - 1). The algorithm works because ed = 1 mod the least common multiple of (p - 1)(q - 1). The algorithm is only secure if very large prime numbers are used. A number of 100 digits is too small - this size of number can be factored now with a multiprocessor technique developed by Arjen K. Lenstra of the University of Chicago and Mark Manasse of Digital Equipment Corporation [COS]. The key generation for RSA is a bit nasty, but is reasonable using Rabin's test for primality: Let n = 2^rd+1, where d is odd. Choose a at random from 1 < a < n - 1. Accept n as prime if either a^d = 1 mod n or a^2jd = -1 mod n for some j such that 0 <= j < r, otherwise reject it. RSA is excellent for public key and authentication use, but it does have certain disadvantages for general encryption use. The processing time required for key generation and for the encryption/decryption process makes it a poor choice for use with high data rates, although there are some reasonably efficient ways to do the exponentiation [CHI]. The complexity of the algorithm makes it more difficult to implement than many others [VAN]. The worst problem, though is the way that the complexity of solving the algorithm could be reduced by several orders of magnitude by mathematical research. For example, J. Sattler and C. P. Schnorr recently published some algorithms that would appear to reduce the complexity of solving an RSA key by a factor of about 10^4 [SAT]. Granted that we are talking about taking 9 x 10^8 years instead of 1.5 x 10^13 years, but it is the potential for other discoveries that may drastically reduce this time even further that is the real threat. D. Knapsack A knapsack or trapdoor algorithm is one that is relatively easy to perform, but which is very difficult to invert. The RSA algorithm is one such algorithm. Another one, proposed by Ralph C. Merkle, uses the following vector operation: given an integer s and an integer vector a = a1, a2, ..., an find a vector k = x1, x2, ..., xn where xi is in {0,1} such that s = x * a. (* denotes dot product.) [MER] This algorithm is referred to in another paper as having been broken [VAN]. Another algorithm using matrix operations is suggested by H. Retkin [RET]. This algorithm may be good, but it is not obvious to me that no one will come up with a good solution to breaking it, too. Therefore, I prefer to use encryption algorithms that have simple brute force solutions that are well understood, but just take too long to perform to be of concern. E. Rotor Machines Rotor machines were once the state of the art in cryptography. They are a way to form a polyalphabetic solution automatically. An excellent history of these machines is contained in [KAH], [DEA], and [KOZ]. Although they are obsolete now due to the superiority of many computer algorithms, they played a major role in the history of cryptography. There were many different variations on the rotor machine, but the main principle of each of them was that the input, which is normally a indicated by one of 27 (or whatever the alphabet size in use is) wires is connected to a voltage level by a keyboard. This wire is then connected via rotary contacts to a rotor that physically permutes the position of the wire. This causes a simple substitution for each character, or a poly alphabetic substitution. The output is taken from rotary contacts on the other side of the rotor. Several of these rotors are placed in series. After each character is encrypted, one or more of the rotors is moved to a new position by a ratchet mechanism. The net effect is a different simple substitution for each letter in the message. Common additions to this scheme are a plugboard style permutation performed before or after the rotors and the addition of a reflecting rotor to run the signal back through the rotors in the opposite direction, using each rotor twice. The key used is a combination of several things: (1) the permutation performed by each rotor (changed by rewiring it), (2) the order in which the rotors were assembled, (3) the starting position of the rotors, (4) the plugboard connections, (4) the number of rotors used, and (5) the ratchet arrangement. Taken all together, these form a large key space. If the entire key space is varied simultaneously and for every message, then this polyalphabetic substitution is very secure. Due to the construction of these machines, some of the variables were not as easy to change as others. Therefore in practical use, the variables were not changed all at once. For example, the number of rotors and the ratchet arrangement would probably be fixed for the life of the machine. Therefore, if this portion of the key is solved for once, that is enough until a new kind of machine replaces it, perhaps years later. The rotor wiring is tedious to change, and therefore unlikely to be changed very often. The order of the rotors might be changed periodically, but during heavy use, the only thing that was changed for each message is usually the starting position of the rotors. Using an alphabet size of 26 and five rotors, varying only the starting position of the rotors provides a key space of less than 12 million, which is within the range of possibility of solution by mechanical computer, and a quick task for one of today's PCs. With many messages encrypted with the same rotors, the rotor wiring can be solved by frequency analysis ``in depth.'' In other words, the permutation applied to the first letter of each message can be solved by determining the frequency of occurrence of each code letter and comparing that with the language of the source language. This can be done for each position in the message. From these substitutions, the rotor wiring and plugboard connections may be reduced. This is a more lengthy process than solving for the rotor starting position, but once done the solution is likely to be good for a while. Using this kind of approach, isolating the parts of the key, the Allies read many German and Japanese messages during World War II. Rotor machines can be simulated in a computer program with more flexibility than in mechanical hardware, with many improvements. There are, however, better methods to use in computer algorithms that were not easy to use in mechanical devices. Therefore, rotor machines are interesting to study for historical value, but they are obsolete at a time when computers are becoming almost as common as televisions. F. Codes Codes, as opposed to ciphers, operate on linguistic units like words, phrases, and sentences, rather than directly on the units of the alphabet. Codes have the advantage that they generally compress the message. For example, the codeword ``APPLE'' might mean ``The supplies will be shipped on Monday via private courier'' and ``GRAPE'' might mean ``The bid is too high. Reduce it by 10%.'' If there are only a few different messages that might be encoded, a substantial savings in space can be realized. Codes are useful for compression of information, even when no encryption is intended. For example, the Uniform Commercial Code, used for telegraphy saves on tolls, even though it is published and gives no real security. Another example is the use of Q signals on amateur radio, where encryption is illegal but brevity is important, especially when using a slower mode of communication like Morse Code. There QST means ``The following is a message of interest to all Amateurs'' and QSL means ``My location is _____.'' When used to hide the meaning of the message, the code must be changed often. This is much more difficult to do than to change the key to an encryption algorithm. Therefore, this technique has limited value except for some diplomatic and military codes. For commercial applications where brevity and security are both required, it is easier to first encode the plain text with a fixed code to reduce its size, then to encrypt the encoded text with an encryption algorithm whose key can be easily changed. This technique is more secure than the use of the encryption algorithm alone, even if the code used is widely known. See chapter VIII for more on the effects of data compression when used with encryption. G. Galois Field and Hill Cryptosystems A Galois Field cryptosystem is one in which the letters of the encryption alphabet are assigned arbitrarily to the integers from 0 to p^n, where p is prime and n is an integer. These integers are then operated on in blocks of m letters by matrix multiplication with a matrix that represents monic irreducible polynomials of order m - 1. All operations are done modulo p^n. To decrypt the data, the ciphertext is multiplied in the same manner by the inverse of the encryption matrix (obtained using the same modulo arithmetic). The resulting numbers are then substituted back for the letters they represent. The Hill cryptosystem is similar, except that n is fixed at one, and that p need not be absolutely prime as long as several integers less than p are relatively prime to it. For more details on how these cryptosystems work, see the articles by Hill, Cooper, and Patti [HIL] [HLL] [COO] [PAT]. Both of these systems are based on linear algebra over a limited field of integers. Therefore, they are subject to solution of the key in use by linear algebra if enough cipher text and corresponding plain text is available. If the permutation of the encryption alphabet in use is known, then the amount of corresponding plain and cipher text needed for a solution is merely the key length. If the permutation of the encryption alphabet in use is not known then there must be a sufficient quantity to also perform statistical analysis attacks on the alphabet in use. Therefore, I do not recommend the use of these systems for serious encryption because of these vulnerabilities. Not only are they less secure than their key length (the combined length of the coefficients of their matrices) would indicate, they are computationally less efficient than DES and MPJ. Key generation for these systems becomes increasingly complex as the size of the key grows, too, because the probability of coming up with a valid set of monic irreducible polynomials that form an invertible matrix decreases rapidly with matrix size. These cryptosystems, do, however, provide interesting mathematical illustrations on what can be done with Galois Fields. They also provide adequate security for cases where the information being secured is not of great commercial value, and is therefore unlikely to be cryptanalyzed. H. DES The National Bureau of Standards (now known as NITS) Data Encryption Standard (DES) was published in the Federal Information Processing Standards Publication Number 46, dated January 15, 1977 (FIPS PUB 46) [DES]. For details, please refer to that publication. A summary of the algorithm follows. Encryption consists of an initial permutation, sixteen rounds of encryption, then an inverse of the initial permutation. Each of the sixteen rounds of encryption consist of taking the right half of the input block (32 of the 64 bits) and running it through a nonlinear function of the 32 bits and an internal key, then adding this result to the left half of the input modulo two. This 32 bit answer becomes the next round's right half block. The next round's left half becomes the right half block without modification. The nonlinear function used consists of a bit selection E that selects 48 bits from the input of 32 (several of the bits are repeated). These 48 bits are added modulo 2 to the round key of 48 bits. The result of that operation are then fed six bits each into eight substitution boxes. Each of the eight substitution boxes are different, but the same set of eight boxes are used for each round. Each substitution box gives an output of 4 bits. The output of these boxes are fed into a permutation P that rearranges the output in a fixed manner. The sixteen internal keys are generated from the 56 bit input key by feeding the input key into a fixed permutation that rearranges the order of the key bits. The key is then split into left and right halves called C and D. Each half is shifted left one or two times (according to a fixed table) before generating each internal key. Each of the sixteen internal keys is generated by taking the two halves of the key as shifted and permuting them in a fixed manner. The key and the resulting internal keys are the only things that vary in this algorithm. The initial and final permutations and the contents of each of the substitution boxes are constant. The two permutations used in generating the internal keys are constant. The bit selection and permutation used within the nonlinear function are constant. The strengths of the DES is that its cryptographic strength depends only on the key, that the algorithm is easy to implement in a single IC, that it has been well tested an no one has publicly announced a solution, that hardware and software that uses it is readily available, and that the algorithm places very few restrictions on key generation so that random numbers may be generated by the users for use as keys. The weaknesses of the DES are that the key is too short for security in the face of anticipated increases of computing power, that it is old enough that it is likely that someone has broken it (found a short-cut solution), that hardware implementations of the DES are too slow for some applications, and that it limits itself to be simpler than is really necessary with current technology. VI. DESIGN CONSIDERATIONS FOR MPJ The following paragraphs describe the design considerations that I used in creating the MPJ Encryption Algorithm, and the reasons for each. A. Strength Based on Key The strength of the system must rely on the security of the key only. It cannot depend on the algorithm being kept secret, because the algorithm will be published. Even if the algorithm were not published, it would probably be reverse engineered from software implementations of the algorithm. The algorithm must be constructed in such a way that there is no computationally feasible way to derive the key from samples of corresponding plain text and cipher text. B. Usability of Random Keys The key selection should be as easy as the random selection of a number in a given range. Selecting a very secure key should be no more difficult than flipping a coin once for each bit of the key, or generating keys using a pseudorandom sequence combined with random events such as timing of keystrokes on a computer. A one bit change in the key should provide a drastically different transformation, so that a potential cryptanalyst has no idea when a key that he guesses might be ``close'' to the right one. C. Key Length & Block Size The key length should be significantly longer than the DES 56 bit size. A key size of 128 bits (sixteen 8-bit bytes) was chosen as being very manageable, yet highly secure even when attacked by multiple array supercomputers. The block size was also chosen as 128 bits (twice the size of DES) to provide a significant increase in complexity of the encryption. D. Effort Required to Break The effort required to break the algorithm by any method should be so great as to make such a task unfeasible even if significant advances are made in computer technology. This requirement is intimately linked to the choice for key size and block size. For example, if one thousand keys could be tried every nanosecond (10^12 attempts per second), an exhaustive key search that resulted in success after only testing one millionth of the possible keys (extremely good luck) would take 2^128/((10^12)(10^6)) = 3.4 x 10^20 seconds = 1 x 10^13 years. If someone figured out a computational method or weakness in the algorithm that reduced the complexity by a factor of a million, and that someone also figured out how to compute the solution a thousand times faster, and they still believed in one in a million chances at good luck, they could possibly come up with a solution in only 10,000 years. By way of contrast, testing keys at the rate of 10^12 per second for DES with its 56 bit key, all of the keys can be tested in 7.2 x 10^4 seconds, or about 20 hours. While trying a trillion keys a second is not realistic with current general purpose computers, current technology could be applied to dedicated, highly parallel encryption/decryption engines that could approach that speed - at a cost that would be prohibitive for all but large governments at today's prices. Other than time, the other main measure of complexity of an algorithm is the storage capacity required to implement a lookup table attack. Suppose a lookup table were to be constructed that contained the encrypted version of just one block of cipher text corresponding to a very common block of plain text (such as 16 ASCII spaces) for every key possible. For the MPJ algorithm, this would take 2^128 x 128 = 4.356 x 10^40 bits. If this memory were constructed with some kind of device that could store one bit per atom of silicon, this would take 4.356 x 10^40 bits x 28.086 amu/atom x 1.660531 x 10^-24 grams/amu x 10^-6 metric tons/gram = 2.03 x 10^12 metric tons of silicon. That is about one thousandth of the mass of Deimos, the smaller of Mars' two moons. [CRC] For DES, the same table would only require about 215 micrograms of silicon. Nobody has come up with memory that dense, and fundamental physical limits make such a task difficult, indeed. It is not difficult to conceive of some breakthroughs in optical storage technology coming close, say using a thousand molecules of something per bit. Under those circumstances, DES would be vulnerable to such an attack, where MPJ would still be safe. E. Computational Efficiency The MPJ encryption algorithm must be computationally efficient enough to be implemented in software on a standard IBM PC or compatible (or on an Apple computer of comparable power), and fast enough to handle at least 10 megabits per second when implemented in dedicated hardware. Note that this is less restrictive with respect to the hardware for DES, which was required to be simple enough to implement on a single chip using 1970s technology. F. Communication Channel Efficiency The MPJ encryption algorithm must not significantly increase the size of the plain text when encrypting it. This precludes the use of noise addition as a technique to be used. G. No Back Doors or Spare Keys While it may be impossible to guarantee that no ``back doors'' or ways to decipher a message without the key exist, the algorithm should be sufficiently complex combination of simple, well-understood operations that no help is offered to the cryptanalyst from the structure of the algorithm. Spare keys (the situation where more than one unique key will decipher a message) are avoided by making the number of keys possible much less than the number of possible transformations that can be done on a set of blocks. This is true for MPJ because 2^128 << 2^128!. VII. MPJ ENCRYPTION ALGORITHM A. Description The MPJ Encryption Algorithm can be looked at from the outside like a rather large electronic codebook that operates on 128 bit blocks. The algorithm may be used in several modes. It can be used directly in electronic codebook mode, in block chaining with ciphertext feedback mode, in block chaining with plain text feedback mode, and in stream cipher generation mode. Given a 128 bit key, instructions to encrypt or decrypt, and a 128 bit input block, a unique 128 bit output block comes out. Encryption and decryption are inverse operations that can be done in either order. For example, if P is the plain text block, C is the cipher text block, EK1 is encryption with key 1, and DK1 is decryption with key 1, then C = EK1(P); P = DK1(P). A different cipher text results if the plain text is decrypted first, but DK1(EK1(P)) = P = EK1(DK1(P)). Multiple encryption can be done with several keys, as in these examples with three: C = EK1(EK2(EK3(P))); P = DK1(DK2(DK3(C))) or a different method: C1 = EK1(DK2(EK3(P))); P = DK1(EK2(DK3(C))) 1. Overall Structure of MPJ Before encryption or decryption occurs, the substitution boxes are filled based on the input key. To encrypt a 128 bit input block, it is run sequentially through ten rounds of substitution. Each round of substitution operates on the 16 eight-bit bytes that make up the input block individually. In between each round of substitution is a round of permutation, for a total of 9 rounds of permutation. The permutation operation is identical each time. Each operation (substitution and permutation) has an inverse operation. Decryption is done by running the inverse operation of each of the above steps in the opposite order. After one round of permutation, each bit in the output block depends on every bit in the byte that it is in and on eight bits of the key. After a round of permutation and the second round of substitution, each bit in the block depends on eight bytes of the input block and on eight bytes in the key. After the third round of substitution, each bit in the block is functionally dependent on 15 of the input bytes and on 15 bytes of the key. After the fourth round of substitution, each bit in the block is functionally dependent on every bit of the input block and on each bit of the key. After the tenth round of substitution the functional dependence on every bit of the key and the input block is so complex that it would be infeasable to determine the key used, even if the cryptanalyst has known corresponding plain and cipher text and knows the algorithm used. Because the substitution boxes are totally nonlinear functions, the problem of finding each of the 40,960 internal key bits is a tough one, indeed. It would be easier to find the original key of only 128 bits by brute force - something that is difficult, indeed. 2. Substitution Boxes The substitution boxes are designed to be big enough to provide a large number of possible arrangements, but small enough to still fit comfortably within the memory of an MS-DOS computer. Each substitution box is therefore a collection of 16 substitution boxes that operate on only 8 bits at a time, with a total of 256 entries. The substitution boxes may be an array or look-up table in a computer, or they may be implemented in hardware using RAM. The substitution boxes are filled during the internal key generation process with a permutation of numbers that depends on the main key in a different way for each substitution box. 3. Wire Crossings The wire crossings serve to extend the functional dependence of the output block on the input block across the internal byte boundaries. This is done by making each byte of the output of the permutation to contain one bit from each of eight different bytes. The selection of bits was chosen to make a computer implementation fairly efficient. A pure hardware implementation of this operation is, of course, much faster, since it involves only the propagation delay of the signal from one end of a short wire to the other. The input block bit positions are numbered from left to right (MSB to LSB) as 127 down to 0, then the new positions those bits occupy after the wire crossing are (MSB to LSB): 55, 46, 37, 28, 19, 10, 1, 120, 111, 102, 93, 84, 75, 66, 57, 48, 39, 30, 21, 12, 3, 122, 113, 104, 95, 86, 77, 68, 59, 50, 41, 32, 23, 14, 5, 124, 115, 106, 97, 88, 79, 70, 61, 52, 43, 34, 25, 16, 7, 126, 117, 108, 99, 90, 81, 72, 63, 54, 45, 36, 27, 18, 9, 0 4. Key Generation Internal key generation in the MPJ encryption algorithm consists of filling all of the substitution boxes (arrays in software or RAM in hardware). There are 16 substitution boxes used for each of 10 rounds. Each substitution box contains 256 entries of 8 bits each. Therefore, the actual internal keys have a combined length of 16 x 10 x 256 x 8 = 327,680 bits. These are obtained by manipulation of the 128 input key bits. Note that trying to attack the MPJ encryption algorithm by brute force trial and error using internal keys instead of using the 128 input key bits is ridiculous. The calculations given above indicate how difficult even a 128 bit random key is to solve by brute force. The only way that an attack on internal keys is useful is if there are parts of the internal keys that can be solved for separately, without having to solve for as many as 128 bits at once. Attacks against the internal key structure are made computationally intractable in three ways. First, the smallest chunk of the internal key that could be meaningfully solved for is the contents of one of the substitution boxes. Each of these contain 256 bytes, or 16 times the length of the input key. Remember that the addition of each bit doubles the complexity of the solution, so this is not attractive compared to solving for the input key. The second impediment to this attack is to make the relationship between the contents of the substitution boxes and the input key quite nonlinear. The way this is done is by using a combination of rotating bit selection and the use of the last substitution box filled to compute the contents of the next one. The third line of defense against the solution of internal keys is the use of ten nonlinear rounds of encryption to make any attempt at constructing a system of linear equations to solve for the internal key given known plain text and corresponding cipher text infeasable. Note that each round of encryption in MPJ causes the whole 128 bit block to be changed, where each of the rounds in DES only modifies half of its 64 bit block. Therefore the internal structure of MPJ is very difficult to solve for, with each bit of the output having a nonlinear functional dependence on every bit of the input and on the contents of 1 + 8 + 15 + (7 x 16) = 136 of the substitution boxes, containing a total of 34,816 bits. If the substitution boxes were linear, this solution would be possible, but there is no method I know of to solve such a system of nonlinear equations, either here or the simpler case of the DES internal structure. To find the inverse of the substitution box functions, separate inverse substitution boxes are formed that place each address of a substitution box at the address of the data contained in the substitution box. For example, if the contents of the first substitution box in the first round of encryption for the input value of 1 is 219, then the contents of the first inverse substitution box in the last round of decryption for the input of 219 will be 1. In equations, SI[i, j, S[i, j, k]] = k for all i, j, k, where the first index of the array is the round of encryption, numbered from 0 to 9, or the round of decryption, numbered from 9 to 0 (defined differently for programming convenience); the second index is the position of the 8 bits that the substitution applies to in the input block, numbered from 0 to 15, left to right; and the third index is the value of the 8 bits input to the block. The actual algorithm for substitution box filling consists of three main parts. The first one is a nested loop to fill one substitution box, permute the key with the same permutation used for the encryption operation, then to run each byte of the key through the substitution box that was just filled. This is done for each round of encryption, from 0 to 9, with each substitution box used in one round of encryption, from left to right, done within each of these rounds. In other words, the index of the three dimensional array that selects the position of the substitution box within each array varies faster than the index that selects the number of the round of encryption that the box uses. The second algorithm is the one that fills each substitution box based on the contents of the input key, as modified by the previous steps. This algorithm uses bit selection and rotation to determine where each value of the output of the substitution box goes. To be an invertable function, each of the possible output values from 0 to 255 must appear exactly once somewhere within the array with 256 possible slots, numbered from 0 to 255. The first number placed in the array is 255, and there are 256 possible places to put it. The second one is 254, with 255 places to put it. This is continued until the last value, 0 is placed in the one remaining slot. For the first half of the values, the formula pos = (n * m) div 255 is used to determine which of the unfilled slots (counting in index order from 0 to n) is to be used to contain n. The variable pos is the position (counting only unfilled slots), * denotes multiplication, and div is the integer division operator (remainders are ignored). The variable m is determined for the first value of n (255) by selecting one bit from each byte of the key, starting with the least significant bit from the least significant byte of the key. The next bit (place value of 2) is selected from the next most significant bit of the key, and so on until m has eight bits. For the next value of n (254), the same thing is done, except that the bit selected is one bit position to the left of the one selected last time. The place value of the bit selected is the same in m as it is in the byte it came from. The bit to the left of the MSB in a byte is the LSB of the byte to the left of it. The byte to the left of the most significant byte is the least significant byte. For the values of n from 127 to 0, then the same thing is done, except that only seven bits are selected for m, and the position is determined from pos = (n * m) div 127. For a more concise explanation of this process, see the commented pascal procedure makesbox in the program in appendix A. The third algorithm used in internal key generation is the generation of inverse substitution boxes. This is done with a simple nested loop that fills the inverse substitution boxes according to the formula si[i, j, s[i, j, k]] for i from 0 to 9, j from 0 to 15, and k from 0 to 255. The array si is the collection of inverse substitution boxes, and the array s is the collection of substitution boxes filled by the above two algorithms. Filling of inverse substitution boxes is only needed for decryption mode. B. Implementation in Pascal Pascal source code for a program to implement the MPJ Encryption Algorithm is given in Appendix A. 1. Exceptions from Standard Pascal This program was written for MS-DOS computers and compiled using Borland Turbo Pascal 5.0. To adapt this program for use on another system, note that the following features of Turbo Pascal that do not conform to the ANSI/IEEE770X3.97-1983 Standard Pascal were used in this program: (i) The assign procedure is used to associate an operating system file name with a Pascal file name. (ii) The nonstandard file handling procedures blockread, blockwrite, close, and seek, were used. The nonstandard function filepos was used. (iii) The type longint (a 32 bit integer) was used for input handling to allow range checking without invoking a system error for a number that was only slightly out of range. A 16 bit integer could be used instead. (iv) A generic file type was used, which is not defined in ANSI Pascal. (v) The nonstandard operators shl and xor were used. The shl could be replaced with integer division by 2 (div 2), and the xor could be replaced by expressions of the form ((A and (not B)) or ((not A) and B)). (vi) Logical operators (and, or, not, xor) were used with integers to perform bit-wise operations. (vii) The + operator was used to concatenate strings. (viii) The uses clause links in separately compiled units. (ix) The include comment {$I filename} includes additional source code to be compiled with the current file. (v) Additional nonstandard features were used in writing startup and the file handling functions that are unique to MS-DOS. These functions would best be rewritten for a different target system. 2. Main Program The main program calls startup to initialize variables, get the key from the user, and determine which files are to be encrypted or decrypted. It then calls makesbox to generate the internal keys (fill the substitution boxes). If the decryption mode is to be used, it calls makesi to fill the inverse substitution boxes. It then does the actual encryption or decryption. The main program allows for the encryption of multiple files (specified using MS-DOS wildcards), and passes these files to the encryption routine one block at a time. To provide additional security, files are encrypted in place, with each block of output overwriting the corresponding input block in the same place on the disk. If it is desired to keep a copy of both encrypted and plain text versions of the file, the input file should be copied before running this program. Because this program uses block chaining with ciphertext feedback, only the encryption mode of the MPJ algorithm is used, the source of the feedback (input or output) being determined by the decryption boolean variable. This mode is used because (1) it obscures repetitive patterns in the source file that the electronic codebook mode might not hide, (2) it accommodates files of any number of bytes (including a short block at the end) with no complications and results in an output file that is the same length as the input file, and (3) it is slightly faster in operation because the procedure makesi does not have to be called. This mode does require an initialization vector, but since the initialization vector is encrypted before use, a constant initialization vector may be used. To use the electronic codebook mode, delete the for statement following the calls to encrypt, replace one call to encrypt with a call to decrypt, and remove the comment delimiters from around the call to makesi. 3. Procedures Encrypt & Decrypt These procedures directly reflect the overall structure of MPJ. The are simply calls to the routines that perform the substitutions and permutations, calling them in the proper order and passing a parameter to the substitution routines that select the right set of substitution (or inverse substitution) boxes for the operation being done. The procedure encrypt makes 10 calls to substitute, with 9 calls to permute (one call to permute between each pair of calls to substitute. The procedure decrypt makes 10 calls to isubst, with 9 calls to ipermute, done in the inverse order of encryption. 4. Procedures Permute & Ipermute Permute just scrambles the order of the bits in the 128 bit block in such a way as to maximize the dependence of each byte on every other byte of the input. Each byte is formed by selecting the least significant bit from the least significant bit of the corresponding input byte, then the next most significant bit from the next byte to the left, and so on. The least significant byte is considered to be to the left of the most significant byte. Ipermute is just the inverse of this permutation, with bytes being selected to the right instead of the left. 5. Procedures Substitute & Isubst These procedures are simple array lookups in which each byte of the input block is used as the third index of the array s for the procedure substitute or si for the procedure isubst. The output is the contents of the array. The first index is the round of encryption or decryption (counting backwards if it is decryption), and the second index of the array is the byte position (0 to 15) within the input and output blocks. C. Implementation in Hardware For use with high data rates (greater than 10 Megabits per second), it is desirable to implement the MPJ algorithm directly in hardware. This also provides the advantage of greater security against tampering than a program on a general purpose computer might be subject to. The hardware implementation consists of six basic elements: (1) Key generation, (2) serial to parallel conversion, (3) substitution, (4) wire crossings, (5) parallel to serial conversion, and (6) timing control. The key generation is done by a program running in a microprocessor. This program takes the key as input and fills the substitution boxes, which are simply static RAM chips. Serial to parallel conversion and parallel to serial conversion is done with shift registers. The wire crossings are done with physical wire crossings (or printed circuit board trace crossings) - a technique that is hard to beat for speed. Timing control is done with a sequential circuit that is synchronized to the incoming data stream. The output data stream will, of necessity, be delayed by at least the block size of 128 bits, and some kind of protocol must be used to synchronize the blocks at the sending and receiving ends. The following diagram shows a block diagram of a hardware implementation of the MPJ Encryption Algorithm. There are, of course, many variations possible. For example, block chaining could be used with hardware exclusive-or gates to do the modulo 2 addition and flip-flops for unit delays. D. Strengths & Weaknesses The main strengths of the MPJ encryption algorithm are: It is much more secure than DES. When used in conjunction with data compression, MPJ is probably more secure than many of the best military and diplomatic codes & ciphers. It is easy to generate a key for MPJ. It is capable of high data rates when implemented in hardware. It will run on a personal computer. It takes more time to generate internal keys from an external key than to actually encrypt or decrypt data, making exhaustive key search attack more difficult than normal use. The algorithm is published and may be freely used for any legal purpose. There has not been a lot of time for anyone to search for a short-cut solution to MPJ, nor the volume of sensitive material to motivate such a search as DES has had. The main weaknesses of the MPJ encryption algorithm are: The key and block sizes are fixed (optimized for the memory and processing power of the IBM PC), so it doesn't take very good advantage of more powerful computers other than running faster. The software implementation isn't very fast. There is no way to prove that there is no short-cut solution to MPJ (or DES, or almost any encryption algorithm except for the one-time key tape). Since the MPJ encryption algorithm is my own invention, I could possibly be blind to some weakness that it might contain. VIII. DATA COMPRESSION A. Purpose Data compression has two obvious purposes: to save on communication channel capacity required to send a message and to save on media capacity required to store a message. The third purpose for data compression, pointed out by C. E. Shannon, is to make decryption of a message more difficult [SHA]. In reading through the history of cryptanalysis, it became very obvious to me that the one thing that all cryptanalysis relies on the most is the presence of a great deal of redundancy present in natural languages. This is especially true for messages that the cryptanalyst already has something of an idea of what the message is about. It is this redundancy that enables the cryptanalyst to tell which of several possible solutions is the right one - the one that ``makes sense.'' For example, if the message is between English-speaking parties, then the of the possible messages ``ACCEPT THE PROPOSAL,'' ``A;LSDI FUPE ZAARED,'' and ``XXDKFJ DDDLKJDJFFZA,'' the first message is instantly recognized as the only one that makes sense. The reason that this is so obvious is because of the redundancy of natural languages. If all of the redundancy of a message is removed, then there is no criteria to select one possible answer over another, and the cryptanalyst is left baffled. The task at hand is to try to eliminate or at least reduce the amount of redundancy in a message before encryption, then restore the message to its original meaning after decryption. If all of the redundancy of a message is removed, it means that all possible messages that can be constructed from an alphabet have meaning, and that the length of a message is inversely proportional to its probability. This ideal can probably be achieved only for special circumstances with very limited messages. For use with general correspondence in a natural language, however, it is not practical to totally remove the redundancy from all messages. In fact, it is very difficult to even measure redundancy or entropy in a natural language. The concepts of redundancy and entropy (as defined in communication theory texts) are useful for comparisons of various methods of redundancy reduction. Most existing methods of redundancy reduction use either a manually constructed code like the amateur radio operator's Q-signals or an automatic method that operates on either the alphabet or some groupings of alphabet symbols, like trigraphs. Manual techniques are fine for their intended application, but automatic methods that are well-suited for computer implementation are better for use with encryption. Automatic methods are likely to achieve good results with much greater ease of use and less opportunities for errors. The probability of occurrence of various letters in most natural languages are fairly constant over a wide range of types of text, and have been studied and published many times. These probabilities for English letters are published in several of the texts on cryptography listed in the references section of this thesis. Probabilities of digraphs (groups of two letters) and trigraphs (groups of three letters) have likewise been studied, and are fairly constant for a given language. Just changing the representation of trigraphs to variable length codes that are shorter for more common trigraphs and longer for less common trigraphs results in a substantial savings in length of an English message. There are many such methods for compressing data that are fairly straight forward. This and other methods, many of which are discussed by James Storer in his book on data compression [STO]. There are also several public domain and share-ware programs that are available on many bulletin boards that perform data compression. My favorite is PKARC, written by Phil Katz. It dynamically analyzes each input file to determine which one of five compression methods or storing with no compression results in the smallest file. The compression methods used by PKARC are repetition coding, Huffman encoding, Ziv-Lempel-Welch compression, and two different variations on Dynamic Ziv-Lempel-Welch compression [KAT]. PKARC also has an option that allows a rather crude encryption of the archive files with a password. Although this doesn't provide strong cryptographic security, it does help to obscure the redundancy that PKARC adds back in, in the form of cyclic redundancy checks on files and file headers. The CRC checks and file headers would be a good point of vulnerability when trying to cryptanalyze an archived (compressed) file. My feeling was that it was possible to compress natural language text even more than was commonly being done by using linguistic parsing of the text to form a dictionary. I was partially right. My method does remove more redundancy and make most text files smaller and more secure. It won't win any speed contests with PKARC, however, which I recommend that you use if speed is of great importance or if you are going to compress binary files (such as executable programs). In comparing the performance of SQUEEZE with PKARC, I compressed a 4,316,030 byte ASCII text file (the King James Version of the Holy Bible) to 1,312,493 bytes with a language code file of 163,236 bytes. PKARC compressed the same file to 1,779,184 bytes. This is a removal of 69.6% of the redundancy for SQUEEZE (or 65.8% if you add the size of the language code file), compared with a redundancy reduction of only 58.8% for PKARC. This may not seem like much of a difference, but the smaller compressed file fits on a 1.44 Megabyte 3 1/2 inch floppy disk, and the other one doesn't. This would be more of a significant achievement if SQUEEZE weren't so slow. It took well over two hours to do this running on one of the fastest PC's available (25 MHz 80386), while PKARC worked for about a minute on the task. I believe that both the speed and amount of redundancy reduction can be improved upon. I am therefore publishing the source code for what I did in Appendix B, so that someone else may be able to build upon this idea. B. Linguistic Parsing Instead of developing a Huffman code based on trigraphs or other such arbitrary chunks of fixed length, why not develop a Huffman code based on the way people read words? Letter combinations made from statistically properly distributed trigraphs will probably contain a lot of valid English words, but they will contain a lot of garbage, too. If, however, all symbols correspond to English words (or punctuation), then a collection of random symbols will look a lot more like English. Of course, the arrangement of words will probably not make sense in either group, but an intuitive analysis indicates that a code based on linguistic parsing of words will contain a lot less redundancy than will a fixed-length code. To define linguistic symbols, I called any grouping of alphabetic symbols unbroken by non-alphabetic symbols a word symbol. I called any group of contiguous spaces a space symbol. I called a carriage return + line feed combination a new line symbol. Any other punctuation and numbers were called symbols themselves. There may be more efficient ways to split up text into symbols, since, for example, words are almost always followed by a space. I also counted upper and lower case words different symbols. It may have been more efficient to use a modifier symbol to indicate that the following word is capitalized or written in all capitol letters. Of course, if all of the input text is all upper case, as would be the case for some telegraphic type communications systems, this is not a concern. Once a linguistic symbol is detected in the input text, it is encoded with its corresponding Huffman code representation. This is then decoded at the other end of the communications link or after retrieving the data from storage, using the same Huffman code. So that the Huffman code generation does not have to be done each time a file is compressed, an escape code that meant ``the following symbol of ____ bits is not in the code, and is to be interpreted directly'' is used. For these symbols, a simple suppression of the most significant bit is used, resulting in a savings of one out of eight bits. C. Huffman Coding Given a set of symbols and their associated probabilities of occurrence, a Huffman code for those symbols is constructed by generating a tree. Start with a set of symbols as individual nodes. While there are still separate nodes, combine the two nodes with the lowest probability (resolve ties arbitrarily) into a new node with a probability equal to the sum of the probabilities of the two nodes that were combined. Attach one of the nodes to the new node via a 1 and the other node via a 0. After all of the nodes are combined, the code for each symbol is the sequence of ones and zeroes assigned to the branches of the tree starting at the root and ending at the symbol. [STO] Note that this algorithm results in a very similar code to a Shannon-Fanno code (which generates the tree by starting at the root and building out to the leaves), but it is computationally more efficient to implement. D. Pascal Programs The pascal programs that implement this linguistic file compression are in appendix B. The first one, COUNT, merely counts the frequency of occurrence of words in sample text. It is not necessary that the text analyzed be the same text to be compressed, just that it be similar. For example, better compression would be expected when using a code based on personnel records when using that code to compress personnel records than when compressing medical discussions. However, the significant gains made in compressing the more common words in the code offset the lesser compression (or even possible expansion) encountered when words not in the code are encountered. It is also desirable to COUNT a large sample of text to ensure accurate word frequencies. The second program, MAKETREE takes the frequency data from the output file generated by COUNT and generates a Huffman code from it. It then writes this tree out into a file that is used by SQUEEZE. To get reasonable speed from this program, it uses as much memory as MS-DOS will let it for the more frequently accessed information, and uses a temporary file for the rest of the information. The program runs significantly faster if this temporary file is on a RAM disk in extended or expanded memory (beyond the basic 640 KBytes of the MS-DOS domain). SQUEEZE, the third program, does the squeezing and unsqueezing of files based on the output file of MAKETREE. The same Huffman coding tree can be used for many different input text files, because the statistics of English (especially when the same types of text, such as all business letters or all technical reports) remain fairly constant even when the content of the text conveys different meanings. SQUEEZE handles words not in its dictionary in stride, but highlights them as they scroll across the screen in processing so that the user can get a feel for how well the text is matched to the code tree in use. If only a few proper names and such are highlighted, then the code in use is a good one to use. If not, then perhaps a regeneration of the code would be in order. Note that the code file used to SQUEEZE a file must be identical to the one used to unSQUEEZE it, so it must either be stored or sent with the file, or sent in advance and agreed upon by communicating parties. After testing these programs on various ASCII text files, it can be seen that significant compression can be obtained, and that the compressed files can be recovered reliably. There is, however, room for improvement in the exact way that text is parsed and in the speed of the implementation. Nevertheless, when maximum security is desired, I recommend first compressing an ASCII text file with SQUEEZE, then encrypting it with CRYPTMPJ. The code file (LANGUAGE.COD) used by SQUEEZE and the encryption key should be sent by a separate, secure channel to the receiving party (or stored in a physically secure location for data storage) in advance of the message. The combination of minimal natural redundancy in the message sent and the use of a good encryption algorithm should be difficult to crack. IX. CONCLUSION The MPJ Encryption algorithm provides an alternative to DES that is more secure, providing that the software and/or hardware does not get corrupted or tampered with, and providing that the keys are managed effectively. I have implemented and tested the algorithm in software for the IBM PC and compatible machines. The algorithm may be implemented directly in hardware for faster data rates. The MPJ Encryption Algorithm may be used by itself, or it may be improved upon by compressing the data before encryption. The use of data compression in conjunction with any encryption algorithm drastically increases the security of the encrypted data. For natural language text, one approach that yields improved compression over the compression of constant size blocks of data is the compression of linguistic units, such as words. I have achieved reversible compression of such files to less than 35% of their original size using linguistic parsing and a Huffman code. For binary data, such as computer programs, the use of existing techniques, such as those in PKARC, written by Phil Katz and available on most computer bulletin boards, are recommended. It is hoped that the MPJ encryption algorithm, as well as some of my ideas on data compression, will make a positive contribution towards data security, communications privacy, and efficiency of data storage and transmission. May God prosper those who use this knowledge and build upon it for good. REFERENCES [ALB] Douglas J. Albert and Stephen P. Morse,``Combating Software Piracy by Encryption and Key Management,''Computer, Vol. 21, No. 5, April 1984, pp. 68-73 Los Alamitos, CA: IEEE Computer Society. [AME] Stanley R. Ames, Jr., Morrie Gasser, and Roger R. Schell, ``Security Kernel Design and Implementation: An Introduction,'' Computer, Vol. 16, No. 7, July 1983, pp. 14-22 Los Alamitos, CA: IEEE Computer Society. [BAL] W. W. Rouse Ball & H. S. M. Coxeter, Mathematical Recreations & Essays. New York: The MacMillan Company, 1962. [BAU] Friedrich L. Bauer, ``Cryptology - Methods and Maxims,'' in Proceedings of the Workshop on Cryptography at Burg Feuerstein, Germany, March 29-April 2, 1982, Berlin, Heidelberg, New York: Springer-Verlag, 1983. [BEK] H. J. Beker, ``A Survey of Encryption Algorithms,'' in International Conference on Secure Communication Systems, 22-23 February 1984, London, New York: Institute of Electrical Engineers, 1984. TK5102.5.I519.1984 [BET] Thomas Beth, Introduction to Proceedings of the Workshop on Cryptography at Burg Feuerstein, Germany, March 29 - April 2, 1982, Berlin, Heidelberg, New York: Springer-Verlag, 1983. [BON] D. J. Bond, ``Practical Primality Testing,'' in International Conference on Secure Communication Systems, 22-23 February 1984, London, New York: Institute of Electrical Engineers, 1984. TK5102.5.I519.1984 [BRO] C. B. Brookson and S. C. Serpell, ``Security on the British Telecom Satstream Service'' in International Conference on Secure Communication Systems, 22-23 February 1984, London, New York: Institute of Electrical Engineers, 1984. TK5102.5.I519.1984 [BUR] Dave Bursky, ``Protect Your EEPROM Data and Gain More Flexibility,'' Electronic Design, 26 May 1988, pp.43-48. Waseca, MN: Brown Printing Co. [CHA] Leslie S. Chalmers, ``An Analysis of the Differences Between the Computer Security Practices in the Military and Private Sectors,'' in Proceedings of the IEEE 1986 Symposium on Security and Privacy, Oakland, CA, April 7-9, 1986. QA76.9.A25.595.1986 [CHI] H. R. Chivers, ``A Practical Fast Exponentiation Algorithm for Public Key,'' in International Conference on Secure Communication Systems, 22-23 February 1984, London, New York: Institute of Electrical Engineers, 1984. TK5102.5.I519.1984 [COO] R. H. Cooper, ``Linear Transformations in Galois Fields and their Application to Cryptography,'' Cryptologia Magazine, Volume 4, Number 3, July 1980, pages 184-188. Republished electronically by Tony Patti. [COS] Terry Costlow, ``Global Computer Network Cracks Cryptographic Mathematics Barrier,'' Electronic Engineering Times, Issue 508, 17 October 1988, p. 14, Manhasset, NY: CMP Publications, Inc. [CRC] Handbook of Chemistry and Physics, 53rd Edition, Cleveland, OH: The Chemical Rubber Company, 1972. [DEA] Cipher A. Deavers and Louis Kruh, Machine Cryptography and Modern Cryptanalysis, Norwood, MA: Artech House, Inc., 1985. Z103.D43.1985 [DEN] Dorothy Elisabeth Robling Denning, Cryptography and Data Security, Reading, MA; Menlo Park, CA; London; Amsterdam; Don Mills, Ontario; Sydney: Addison-Wesley Publishing Company, 1982. QA76.9.A25.D46.1982 [DES] National Bureau of Standards, Federal Information Processing Standards Publication Number 46 Dated 15 January 1977. [DOH] Richard Doherty, ``FBI Nabs Pirated VCII,'' Electronic Engineering Times, Manhasset, N. Y.: CMP Publications, Inc., Issue 500, August 22, 1988. [EET] ``NSA OKs Decryption Unit,'' Electronic Engineering Times, August 29, 1988, Manhasset, N. Y.: CMP Publications, Inc. [FRA] Ole Immanuel Franksen, Mr. Babbage's Secret - The Tale of a Cipher and APL, Englewood Cliffs, NJ: Prentice-Hall, 1985. Z103.B2.F72.1985 [FRI] H. J. Beker, J. M. K. Friend, and P. W. Halliden, ``Simplifying Key Management in Electronic Fund Transfer Point of Sale Systems,'' in International Conference on Secure Communication Systems, 22-23 February 1984, London, New York: Institute of Electrical Engineers, 1984. TK5102.5.I519.1984 [FRM] Lester J. Fraim, ``Scomp: A Solution to the Multilevel Security Problem,'' Computer, Vol. 16, No. 7, July 1983,pp. 26-34, Los Alamitos, CA: IEEE Computer Society. [GOO] G. Goos and J. Hartmans, Lecture Notes in Computer Science. Z102.5.W67.1982 [GOR] J. A. Gordon and H. Retkin, ``Are Big S-Boxes Best?'' in Proceedings of the Workshop on Cryptography at Burg Feuerstein, Germany, March 29-April 2, 1982, Berlin, Heidelberg, New York: Springer-Verlag, 1983. [HAE] D. G. Haenshke, D. A. Kettler, and E. Oberer, ``Network Management and Congestion in the U. S. Telecommunications Network,'' IEEE Transactions on Communications, volume COM-29,pp. 376-385, April 1981. [HEL] Gilbert Held and Thomas R. Marshall, Data Compression, Second Edition; Chichester, New York, Brisbane,Toronto, Singapore: John Wiley & Sons, 1987. QA76.9.D33.H44.1987 [HIL] Lester S. Hill, ``Cryptography in an Algebraic Alphabet,'' American Mathematical Monthly, June 1929, pages 306-312, the American Mathematical Society. Republished electronically by Tony Patti by permission. [HLL] Lester S. Hill, ``Concerning Certain Linear Transformation Apparatus of Cryptography,'' American Mathematical Society, March 1931, pages 135-154. Republished electronically by Tony Patti by permission. [JAC] A. M. Jackson, N. A. McEvoy, and B. B.Newman, ``Project Universe Encryption Experiment'' in International Conference on Secure Communication Systems, 22-23 February 1984, London, New York: Institute of Electrical Engineers, 1984. TK5102.5.I519.1984 [KAH] David Kahn, The Codebreakers - The Story of Secret Writing, New York: The MacMillan Company, 1987. Z103.K28 [KRU] Cipher A. Deavers, David Kahn, Louis Kruh, Greg Mellen, and Brian Winkel, Cryptology Yesterday, Today and Tomorrow, Norwood, MA: Artech House, Inc., 1987. Z103.C76.1987 [LAN] Carl E. Landwehr, The Best Available Technologies for Computer Security,'' Computer, Vol. 16, No. 7, July 1983,pp. 86-100, Los Alamitos, CA: IEEE Computer Society. [KAT] Phil Katz, PKARC FAST! Archive Create/Update Utility Version 3.5 04-27-87, published electronically in PKX35A35.EXE. [KON] Alan G. Konheim, Cryptography: A Primer, New York: John Wiley & Sons. Z103.K66 [KOZ] Wladyslaw Kozaczuk, Enigma - How the German Machine Cipher was Broken and How it was Read by the Allies in World War Two, University Publications of America, Inc., 1984.D810.C88.K6813.1984 [LIN] Christer Lind<130>n, ``Electronic Seal for Protection of Electronic Money in Sweden, Finland, and Norway'' in Proceedings of 1986 International Carnahan Conference on Security Technology: Electronic Crime Countermeasures, August 12-14, 1986. QA76.9.A25.C41.1986 [LOD] N. Lodge, B. Flannaghan, and R. Morcom,``Vision Scrambling of C-MAC DBS Signals,'' in International Conference on Secure Communication Systems, 22-23 February 1984, London, New York: Institute of Electrical Engineers, 1984. TK5102.5.I519.1984 [LU] W. P. Lu and M. K. Sandareshan, ``A Hierarchical Key Management Scheme for End-to-End Encryption in Internet Environments'' in Proceedings of the IEEE 1986 Symposium on Security and Privacy, Oakland, CA, April 7-9, 1986. QA76.9.A25.595.1986 [MCL] Vin McLellan, ``Drugs and DES: A New Connection,'' Digital Review, August 24, 1987. [MER] Ralph C. Merkle, Secrecy, Authentication, and Public Key Systems, Ann Arbor, MI: UMI Research Press,1982. QA76.9.A25M47.1982 [MEY] Carl H. Meyer & Stephen M. Matyas,Cryptography: a New Dimension in Computer Data Security - A Guide for the Design and Implementation of Secure Systems, New York, Chichester, Brisbane, Toronto, Singapore: John Wiley & Sons, 1982. Z103.M55.1982 [MIT] C. J. Mitchell, ``A Comparison of the Cryptographic Requirements for Digital Secure Speech Systems Operating at Different Bit Rates,'' in International Conference on Secure Communication Systems, 22-23 February 1984, London, New York: Institute of Electrical Engineers, 1984. TK5102.5.I519.1984 [PAT] Tony Patti, A Galois Field Cryptosystem, 1986. Published electronically by Tony Patti, editor of Cryptosystems Journal, 9755 Oatley Lane, Burke, VA 22015. [PIP] Fred Piper, ``Stream Ciphers,'' in Proceedings of the Workshop on Cryptography at Burg Feuerstein, Germany, March 29-April 2, 1982, Berlin, Heidelberg, New York: Springer-Verlag, 1983. [PER] Tekla S. Perry and Paul Wallich, ``Can Computer Crime be Stopped?,'' IEEE Spectrum, May, 1984, pp.34-49. New York, NY: The Institute of Electrical and Electronic Engineers, Inc. [RET] H. Retkin, ``Multi-Level Knapsack Encryption,'' in International Conference on Secure Communication Systems, 22-23 February 1984, London, New York: Institute of Electrical Engineers, 1984. TK5102.5.I519.1984 [SAT] J. Sattler and C. P. Schnorr, ``Ein Effizienzvergleich Der Faktorisierungsverfahren von Morrison-Brillhart und Schroeppel,'' in Proceedings of the Workshop on Cryptography at Burg Feuerstein, Germany, March 29-April 2, 1982, Berlin, Heidelberg, New York: Springer-Verlag, 1983. [SCH] C. P. Schnorr, ``Is the RSA Scheme Safe?'' in Proceedings of the Workshop on Cryptography at Burg Feuerstein, Germany, March 29-April 2, 1982, Berlin, Heidelberg, New York: Springer-Verlag, 1983. [SHA] C. E. Shannon, ``Communication Theory of Secrecy Systems,'' Bell System Technical Journal, Volume 28, 1949. [STO] James A. Storer, Data Compression Methods and Theory. Rockville, MD: Computer Science Press, 1988. QA76.9.D33S76 [TIL] Henk C. A. van Tilborg, An Introduction to Cryptology, Boston, Dordrecht, Lancaster: Kluwer Academic Publishers, 1988. Z103.T54.1987 [RET] H. Retkin, ``Multi-Level Knapsack Encryption,'' in International Conference on Secure Communication Systems, 22-23 February 1984, London, New York: Institute of Electrical Engineers, 1984. TK5102.5.I519.1984 [ULT] Ultron Labs Corporation, ``Crypto Module Processes TOP SECRET Data,'' EE Product News, October 1988, p. 16, Overland Park, KS: Intertec Publishing. [VAN] J. Vandewalle, R. Govaerts, W. De Becker, M. Decroos, & G. Speybrouk, ``RSA-Based Implementation of Public Key Cryptographic Protection in Office Systems'' in Proceedings of 1986 International Carnahan Conference on Security Technology: Electronic Crime Countermeasures, August 12-14, 1986. QA76.9.A25.C41.1986 APPENDIX A. MPJ in Pascal {$R-} {Range checking off} {$S-} {Stack checking off} {$I+} {I/O checking on} {$N-} {Don't use 80x87 numeric coprocessor} program cryptmpj; { Encryption designed to exceed DES in security value and be implementable on an IBM PC compatible machine. } Uses Crt, Dos; { Include screen handling & MS DOS interface functions. } const maxinbuffer = 16; ap = #39; { Apostrophe for write statements. } type blocks = array[0..15] of byte; { 16 byte (128 bit) blocks. } stype = array[0..9,0..15,0..255] of byte; { Holds substitution boxes. } ptrstype = ^stype; { Dynamic variable required to get beyond 64K limit. } string80 = string[80]; { Long character strings. } string15 = string[15]; { Short character strings. } var initial, { Initialiazation vector. } feedback, { Chaining feedback array. } key, { Encryption/decryption key. } buffer, outbuf: blocks; { File read/write buffers. } s, { Substitution boxes. } si: ptrstype; { Inverse substitution boxes. } bytesdone, { Number of bytes done. } number: longint; { Used to read in key & initialization vector. } que, { Position in file specification que. } i, j, k, { Iteration & array indexes. } actualread: integer; { Actual number of bytes read in from file. } inputfile: file; { File to encrypt or decrypt in place. } keyfile: text; { External key file. } intkeyfile: file of stype; {Holds s and si. } keyname, { Name of external key file. } intkeyname, { Name of internal key file. } path, { Used in parsing file specification. } temp: string[255]; { Used in determining encrypt/decrypt option. } inputfilename, { MS-DOS name of inputfile. } filespec: string80; { File specification that may include wild cards} fileque: array[0..15] of string80; { Names of multiple file specifications} name: string15; { Name of next file to process. } decryption, { True iff decryption is desired. } filefound: boolean; { At least one file was found to process.} ch: char; { Character input. } search: searchrec; { Used in finding matches to wild cards. } procedure permute(x: blocks; var y: blocks); { This procedure is designed to make each bit of the output dependent on as many bytes of the input as possible, especially after repeated application. Each output byte takes its least significant bit from the corresponding input byte. The next higher bit comes from the corresponding bit of the input byte to the left. This is done until all bits of the output byte are filled. Where there is no byte to the left, the byte at the far right is used. } begin y[0] := (x[0] and 1) or (x[1] and 2) or (x[2] and 4) or (x[3] and 8) or (x[4] and 16) or (x[5] and 32) or (x[6] and 64) or (x[7] and 128); y[1] := (x[1] and 1) or (x[2] and 2) or (x[3] and 4) or (x[4] and 8) or (x[5] and 16) or (x[6] and 32) or (x[7] and 64) or (x[8] and 128); y[2] := (x[2] and 1) or (x[3] and 2) or (x[4] and 4) or (x[5] and 8) or (x[6] and 16) or (x[7] and 32) or (x[8] and 64) or (x[9] and 128); y[3] := (x[3] and 1) or (x[4] and 2) or (x[5] and 4) or (x[6] and 8) or (x[7] and 16) or (x[8] and 32) or (x[9] and 64) or (x[10] and 128); y[4] := (x[4] and 1) or (x[5] and 2) or (x[6] and 4) or (x[7] and 8) or (x[8] and 16) or (x[9] and 32) or (x[10] and 64) or (x[11] and 128); y[5] := (x[5] and 1) or (x[6] and 2) or (x[7] and 4) or (x[8] and 8) or (x[9] and 16) or (x[10] and 32) or (x[11] and 64) or (x[12] and 128); y[6] := (x[6] and 1) or (x[7] and 2) or (x[8] and 4) or (x[9] and 8) or (x[10] and 16) or (x[11] and 32) or (x[12] and 64) or (x[13] and 128); y[7] := (x[7] and 1) or (x[8] and 2) or (x[9] and 4) or (x[10] and 8) or (x[11] and 16) or (x[12] and 32) or (x[13] and 64) or (x[14] and 128); y[8] := (x[8] and 1) or (x[9] and 2) or (x[10] and 4) or (x[11] and 8) or (x[12] and 16) or (x[13] and 32) or (x[14] and 64) or (x[15] and 128); y[9] := (x[9] and 1) or (x[10] and 2) or (x[11] and 4) or (x[12] and 8) or (x[13] and 16) or (x[14] and 32) or (x[15] and 64) or (x[0] and 128); y[10] := (x[10] and 1) or (x[11] and 2) or (x[12] and 4) or (x[13] and 8) or (x[14] and 16) or (x[15] and 32) or (x[0] and 64) or (x[1] and 128); y[11] := (x[11] and 1) or (x[12] and 2) or (x[13] and 4) or (x[14] and 8) or (x[15] and 16) or (x[0] and 32) or (x[1] and 64) or (x[2] and 128); y[12] := (x[12] and 1) or (x[13] and 2) or (x[14] and 4) or (x[15] and 8) or (x[0] and 16) or (x[1] and 32) or (x[2] and 64) or (x[3] and 128); y[13] := (x[13] and 1) or (x[14] and 2) or (x[15] and 4) or (x[0] and 8) or (x[1] and 16) or (x[2] and 32) or (x[3] and 64) or (x[4] and 128); y[14] := (x[14] and 1) or (x[15] and 2) or (x[0] and 4) or (x[1] and 8) or (x[2] and 16) or (x[3] and 32) or (x[4] and 64) or (x[5] and 128); y[15] := (x[15] and 1) or (x[0] and 2) or (x[1] and 4) or (x[2] and 8) or (x[3] and 16) or (x[4] and 32) or (x[5] and 64) or (x[6] and 128); end; procedure ipermute(x: blocks; var y: blocks); { This is the inverse of the procedure permute. } begin y[0] := (x[0] and 1) or (x[15] and 2) or (x[14] and 4) or (x[13] and 8) or (x[12] and 16) or (x[11] and 32) or (x[10] and 64) or (x[9] and 128); y[1] := (x[1] and 1) or (x[0] and 2) or (x[15] and 4) or (x[14] and 8) or (x[13] and 16) or (x[12] and 32) or (x[11] and 64) or (x[10] and 128); y[2] := (x[2] and 1) or (x[1] and 2) or (x[0] and 4) or (x[15] and 8) or (x[14] and 16) or (x[13] and 32) or (x[12] and 64) or (x[11] and 128); y[3] := (x[3] and 1) or (x[2] and 2) or (x[1] and 4) or (x[0] and 8) or (x[15] and 16) or (x[14] and 32) or (x[13] and 64) or (x[12] and 128); y[4] := (x[4] and 1) or (x[3] and 2) or (x[2] and 4) or (x[1] and 8) or (x[0] and 16) or (x[15] and 32) or (x[14] and 64) or (x[13] and 128); y[5] := (x[5] and 1) or (x[4] and 2) or (x[3] and 4) or (x[2] and 8) or (x[1] and 16) or (x[0] and 32) or (x[15] and 64) or (x[14] and 128); y[6] := (x[6] and 1) or (x[5] and 2) or (x[4] and 4) or (x[3] and 8) or (x[2] and 16) or (x[1] and 32) or (x[0] and 64) or (x[15] and 128); y[7] := (x[7] and 1) or (x[6] and 2) or (x[5] and 4) or (x[4] and 8) or (x[3] and 16) or (x[2] and 32) or (x[1] and 64) or (x[0] and 128); y[8] := (x[8] and 1) or (x[7] and 2) or (x[6] and 4) or (x[5] and 8) or (x[4] and 16) or (x[3] and 32) or (x[2] and 64) or (x[1] and 128); y[9] := (x[9] and 1) or (x[8] and 2) or (x[7] and 4) or (x[6] and 8) or (x[5] and 16) or (x[4] and 32) or (x[3] and 64) or (x[2] and 128); y[10] := (x[10] and 1) or (x[9] and 2) or (x[8] and 4) or (x[7] and 8) or (x[6] and 16) or (x[5] and 32) or (x[4] and 64) or (x[3] and 128); y[11] := (x[11] and 1) or (x[10] and 2) or (x[9] and 4) or (x[8] and 8) or (x[7] and 16) or (x[6] and 32) or (x[5] and 64) or (x[4] and 128); y[12] := (x[12] and 1) or (x[11] and 2) or (x[10] and 4) or (x[9] and 8) or (x[8] and 16) or (x[7] and 32) or (x[6] and 64) or (x[5] and 128); y[13] := (x[13] and 1) or (x[12] and 2) or (x[11] and 4) or (x[10] and 8) or (x[9] and 16) or (x[8] and 32) or (x[7] and 64) or (x[6] and 128); y[14] := (x[14] and 1) or (x[13] and 2) or (x[12] and 4) or (x[11] and 8) or (x[10] and 16) or (x[9] and 32) or (x[8] and 64) or (x[7] and 128); y[15] := (x[15] and 1) or (x[14] and 2) or (x[13] and 4) or (x[12] and 8) or (x[11] and 16) or (x[10] and 32) or (x[9] and 64) or (x[8] and 128); end; procedure makesbox(key: blocks); { This procedure generates internal keys by filling the substitution box array s^ based on the external key given as input. } var i, j, k: integer; procedure makeonebox(i, j: integer; key: blocks); var pos, m, n, p, startbit, bitmask, startbyte, keybyte: word; empty: array[0..255] of boolean; begin for m := 0 to 255 do { The empty array is used to make sure that } empty[m] := true; { each byte of the array is filled only once. } startbit := 1; startbyte := 0; for n := 255 downto 128 do { n counts the number of bytes left to fill } begin keybyte := startbyte; bitmask := startbit; m := 0; for p := 0 to 7 do { m is obtained by bit selection on the key } begin m := m or (key[keybyte] and bitmask); bitmask := bitmask shl 1; if bitmask > 128 then begin bitmask := 1; inc(keybyte); if keybyte > 15 then keybyte := 0; end; end; pos := (n * m) div 255; { pos is the position among the UNFILLED } m := 0; { components of the s^ array that the } p := 0; { number n should be placed. } while m < pos do begin inc(p); if empty[p] then inc(m); end; while not empty[p] do inc(p); s^[i, j, p] := n; empty[p] := false; startbit := startbit shl 1; { The starting position of the bit } if startbit > 128 then { selection for the key is rotated } begin { left one bit for the next n. } startbit := 1; inc(startbyte); if startbyte > 15 then startbyte := 0 end; end; startbyte := 0; startbit := 1; for n := 127 downto 1 do { This half of the algorithm is the } begin { same as the upper half, except that } keybyte := startbyte; { only 7 bits are selected for m. } bitmask := startbit; m := 0; for p := 0 to 6 do begin m := m or (key[keybyte] and bitmask); bitmask := bitmask shl 1; if bitmask > 64 then begin bitmask := 1; keybyte := keybyte + 3; if keybyte > 15 then keybyte := keybyte - 16; end; end; pos := (n * m) div 127; m := 0; p := 0; while m < pos do begin inc(p); if empty[p] then inc(m); end; while not empty[p] do inc(p); s^[i, j, p] := n; empty[p] := false; startbit := startbit shl 1; if startbit > 64 then begin startbit := 1; inc(startbyte); if startbyte > 15 then startbyte := 0 end; end; p := 0; while not empty[p] do inc(p); s^[i,j,p] := 0; end; begin new(s); for i := 0 to 9 do for j := 0 to 15 do begin write(#13,'Filling substitution boxes for round ',i+1,' of 10, byte ', j+1,' of 16. '); makeonebox(i, j, key); permute(key, key); { Shuffle key bit positions. } for k := 0 to 15 do { Run key through last s-box before } key[k] := s^[i, j, key[k]]; { making the next s-box. } end; writeln; end; procedure makesi; { This procedure fills the inverse substitution box array si^. It is not necessary to call this procedure unless the decryption mode is used. } var i, j, k: integer; begin new(si); for i := 0 to 9 do for j := 0 to 15 do for k := 0 to 255 do si^[i, j, s^[i, j, k]] := k; end; {$I CRYPTINC.PAS} { Include additional file handling & user interface. } procedure substitute(round: integer; x: blocks; var y: blocks); var i: integer; begin for i := 0 to 15 do y[i] := s^[round,i,x[i]]; end; procedure isubst(round: integer; x: blocks; var y: blocks); var i: integer; begin for i := 0 to 15 do y[i] := si^[round,i,x[i]]; end; procedure encrypt(x: blocks; var y: blocks); { Encrypt a block of 16 bytes. } var z: blocks; begin substitute(0, x, y); permute(y, z); substitute(1, z, y); permute(y, z); substitute(2, z, y); permute(y, z); substitute(3, z, y); permute(y, z); substitute(4, z, y); permute(y, z); substitute(5, z, y); permute(y, z); substitute(6, z, y); permute(y, z); substitute(7, z, y); permute(y, z); substitute(8, z, y); permute(y, z); substitute(9, z, y); end; procedure decrypt(x: blocks; var y: blocks); { Decrypt a block of 16 bytes. } var z: blocks; begin isubst(9, x, y); ipermute(y, z); isubst(8, z, y); ipermute(y, z); isubst(7, z, y); ipermute(y, z); isubst(6, z, y); ipermute(y, z); isubst(5, z, y); ipermute(y, z); isubst(4, z, y); ipermute(y, z); isubst(3, z, y); ipermute(y, z); isubst(2, z, y); ipermute(y, z); isubst(1, z, y); ipermute(y, z); isubst(0, z, y); end; begin startup; { Initialize files, call makesbox (and makesi if required). } que := 0; while (que < 16) and (fileque[que] <> '') do begin filespec := fileque[que]; inputfilename := filespec; path := ''; repeat i := pos('\',inputfilename); if i > 0 then begin path := path + copy(inputfilename, 1, i); inputfilename := copy(inputfilename, i+1, length(inputfilename)); end; until i = 0; findfirst(filespec, $22, search); while doserror = 0 do begin inputfilename := path + search.name; assign(inputfile, inputfilename); reset(inputfile,1); filefound := true; bytesdone := 0; for i := 0 to 15 do feedback[i] := initial[i]; if decryption then begin writeln('Decrypting ',inputfilename); while not eof(inputfile) do begin blockread(inputfile,buffer,maxinbuffer,actualread); encrypt(feedback,outbuf); { Note that decrypt is not ever called when using block chaining with ciphertext feedback. It would be called if the electronic codebook mode were used, instead. } for i := 0 to 15 do begin outbuf[i] := outbuf[i] xor buffer[i]; feedback[i] := buffer[i] end; seek(inputfile, filepos(inputfile) - actualread); blockwrite(inputfile, outbuf, actualread); bytesdone := bytesdone + actualread; write(#13, bytesdone:9,' bytes decrypted. '); end; writeln(#13, bytesdone:9,' bytes decrypted. '); end else begin writeln('Encrypting ',inputfilename); while not eof(inputfile) do begin blockread(inputfile,buffer,maxinbuffer,actualread); encrypt(feedback,outbuf); for i := 0 to 15 do begin outbuf[i] := outbuf[i] xor buffer[i]; feedback[i] := outbuf[i] end; seek(inputfile, filepos(inputfile) - actualread); blockwrite(inputfile, outbuf, actualread); bytesdone := bytesdone + actualread; write(#13, bytesdone:9,' bytes encrypted. '); end; writeln(#13, bytesdone:9,' bytes encrypted. '); end; close(inputfile); findnext(search); end; inc(que); end; if not filefound then writeln('No matching files found.'); writeln('Done.'); end. APPENDIX B. Linguistic Data Compression Programs program count; {Program to count the probability of occurances of words in text.} uses dos, crt; const maxbuf = 4096; maxstr = 15; ap = #39; type wd = string[maxstr]; ptr = ^entry; yarn = string[80]; entry = record wrd: wd; {Linguistic word} count: longint; {How many times this word was found} next: ptr; {Next record in sequence} end; var outfile, {File with output report} infile: text; {File with input text to be analyzed} bottom, {Last entry record} newbuf, {New entry record} prev: ptr; {Previous entry record} buffer, {Current entry record} top: array[1..255] of ptr; {First entry record} words: wd; {Word under construction} temp, path, {Path of input file} filemask, {File name mask} filespec, {Path + name (including wild cards)} inname, {Name of input file} outname: yarn; {Name of output file} ch: char; i, j, cmdparameter, topindex: integer; cnt: longint; filefound: boolean; inbuf, outbuf: array[1..maxbuf] of byte; s: searchrec; {Record fill: array[1..21] of byte;attr:byte;time, size:longint;name: string[12]} function exist(filename: wd): boolean; {Returns TRUE iff the file exists.} var f: file; begin {exist} assign(f,filename); {$I-} reset(f); {$I+} if IOresult = 0 then begin exist := true; close(f) end else exist := false; end; {exist} function punct(ch: char): boolean; {TRUE iff ch is not a letter or chr(255).} var c: integer; begin c := ord(ch); if ((c > 10) and (c <= 64)) or ((c >= 91) and (c <= 96)) or ((c >= 123) and (c <= 254)) or ((c > 0) and (c < 10)) then punct := true else punct := false; end; function letter(ch: char): boolean; {TRUE iff ch is a letter A-Z or a-z.} var c: integer; begin c := ord(ch); if ((c >= 65) and (c <= 90)) or ((c >= 97) and (c <= 122)) then letter := true else letter := false; end; procedure parse(var words: wd; var ch: char); {Extracts a word or item of punctuation from input file.} begin {parse} words := ''; while not (letter(ch) or punct(ch) or eof(infile)) do begin read(infile,ch); end; if punct(ch) then begin if ord(ch) = 13 then begin words := chr(255)+chr(255); if not eof(infile) then read(infile, ch); end else if ch = ' ' then begin words := ch; while (not eof(infile)) and (ch = ' ') and (length(words) < maxstr) do begin read(infile,ch); if ch = ' ' then words := words + ch; end; if (length(words) = maxstr) and (not eof(infile)) then read(infile,ch); end else begin words := ch; {return one punctuation character} if not eof(infile) then read(infile, ch); end; end else begin if letter(ch) then begin words := ch; while (not eof(infile)) and letter(ch) and (length(words) < maxstr) do begin read(infile,ch); if letter(ch) then words := words + ch; end; if (length(words) = maxstr) and (not eof(infile)) then read(infile,ch); end; end; end; {parse} procedure update(words: wd); {Increments the count associated with a word.} begin topindex := ord(words[1]); buffer[topindex] := top[topindex]; {Find record that matches word, if it exists.} while (buffer[topindex]^.wrd < words) and (buffer[topindex]^.next <> nil) do begin prev := buffer[topindex]; buffer[topindex] := buffer[topindex]^.next end; if buffer[topindex]^.wrd = words then {Increment count on word that already exists} inc(buffer[topindex]^.count) else begin {Add word to middle of list with count of one} new(newbuf); newbuf^.wrd := words; newbuf^.count := 1; newbuf^.next := prev^.next; prev^.next := newbuf; end; end; procedure report; {Produces a report file.} var total: longint; begin assign(outfile, paramstr(1)); rewrite(outfile); total := 0; for topindex := 1 to 255 do begin buffer[topindex] := top[topindex]; while (buffer[topindex]^.next <> nil) do begin buffer[topindex] := buffer[topindex]^.next; if buffer[topindex]^.count > 0 then begin writeln(outfile,buffer[topindex]^.wrd); writeln(outfile,buffer[topindex]^.count); total := total + buffer[topindex]^.count; end; end; end; writeln(outfile, chr(255)); writeln(outfile, '1'); writeln(outfile, 'Total count: ',total); close(outfile); end; {report} begin clrscr; writeln(' This program counts words in a set This program may be copied freely '); writeln(' of input files to determine their for educational use or to try it '); writeln(' frequency of occurance. This data out. If you like it, donations '); writeln(' is then stored in a file for use by will be accepted by the author. '); writeln(' MAKETREE, which constructs a code This program is believed to be '); writeln(' tree with the Huffman algorithm. reliable, but it is the user',ap,'s '); writeln(' The code tree is used by SQUEEZE. responsibility to determine its '); writeln(' fitness for use. No liability '); writeln(' Mike Johnson will be assumed by the author. '); writeln(' P. O. Box 1151 Copyright (C) 1988 Mike Johnson. '); writeln(' Longmont, CO 80502-1151 All rights reserved. '); if paramcount < 2 then begin writeln; writeln('Syntax: COUNT outputfilename inputfilename[s]'); writeln(' input file names may include * and ?'); writeln(' Input files are assumed to be ASCII text.'); writeln; writeln('If a previous COUNT ouput file named LANGUAGE.OUT exists, then new counts'); writeln('will be added to the counts in LANGUAGE.OUT.'); writeln('Maximum effectiveness will be obtained if the nature of the files COUNTed'); writeln('is similar to the nature of the files to be compressed with SQUEEZE.'); end else begin filefound := false; new(bottom); bottom^.wrd := chr(255); bottom^.count := 0; bottom^.next := nil; for topindex := 1 to 255 do begin new(top[topindex]); top[topindex]^.wrd := chr(1); top[topindex]^.count := 0; top[topindex]^.next := bottom; buffer[topindex] := top[topindex]; end; ch := chr(0); words := ' '; if exist('LANGUAGE.OUT') then begin assign(infile,'LANGUAGE.OUT'); reset(infile); writeln('Reading in existing LANGUAGE.OUT.'); while (not eof(infile)) and (words <> chr(255)) do begin readln(infile,words); if words <> chr(255) then begin topindex := ord(words[1]); readln(infile,cnt); new(newbuf); newbuf^.wrd := words; newbuf^.count := cnt; newbuf^.next := bottom; buffer[topindex]^.next := newbuf; buffer[topindex] := newbuf; write(#13,words,' '); end; end; close(infile); writeln; end; {Initialize input files.} for cmdparameter := 2 to paramcount do begin filespec := paramstr(cmdparameter); filemask := filespec; path := ''; repeat i := pos('\',filemask); if i > 0 then begin path := path + copy(filemask, 1, i); filemask := copy(filemask, i+1, length(filemask)); end; until i = 0; findfirst(filespec, readonly + hidden + archive, s); while doserror = 0 do begin filefound := true; inname := path + s.name; assign(infile, inname); reset(infile); words := ' '; writeln('>>>>>>>>>>>>>>>>>>>>> Processing ',inname); while (not eof(infile)) and (words <> '') and (not keypressed) do begin parse(words,ch); {Find word or punctuation} if words = chr(255)+chr(255) then writeln else write(words); if length(words) > 0 then update(words); {Increment word counter} end; close(infile); findnext(s); end; end; if not filefound then writeln('No matching files found.'); report; end; end. {$R+} {$M 3072,4096,655360} program maketree; uses crt; const maxstr = 15; ap = #39; type wd = string[maxstr]; yarn = string[80]; mem = ^mementry; entry = record wrd: wd; {Which linguistic word in input file} memory: mem; {Corresponding mementry.} end; mementry = record count: longint; {Word count.} onezero: byte; {Is this record pointed to by 1 or 0?} root, {Position of record nearer root.} next: mem; {Position of next record.} end; var temp: file of entry; {Temporary file} infile: text; {File with input text to be analyzed} buffer, {Current entry record} newbuf: entry; {New entry record} mtop, mbottom, mbuffer, mnewbuf, mprev: mem; inname, {Name of input file} outname, {Name of output file} words: wd; {Word under construction} ch: char; cnt1, cnt2, cnt: longint; i, j: integer; tempname: yarn; function exist(filename: wd): boolean; {Returns TRUE iff the file exists.} var f: file; begin {exist} assign(f,filename); {$I-} reset(f); {$I+} if IOresult = 0 then begin exist := true; close(f) end else exist := false; end; {exist} procedure fanno; begin cnt1 := 0; repeat {Find smallest two records.} mprev := mtop^.next; mbuffer := mprev^.next; {Combine lowest 2 probabilities of counts.} mtop^.next := mbuffer^.next; new(mnewbuf); mnewbuf^.count := mbuffer^.count + mprev^.count; mnewbuf^.onezero := 2; mnewbuf^.root := nil; mprev^.root := mnewbuf; mprev^.onezero := 1; mbuffer^.root := mnewbuf; mbuffer^.onezero := 0; {Place combined entry in proper place in chain.} mbuffer := mtop; repeat mprev := mbuffer; mbuffer := mbuffer^.next; until (mnewbuf^.count <= mbuffer^.count); mprev^.next := mnewbuf; mnewbuf^.next := mbuffer; {Check for completion.} inc(cnt1); write(#13,cnt1,' nodes combined. '); until (cnt1 >= cnt2); end; {fanno} procedure placeit; begin write(#13,cnt:11,' ',words,' '); mbuffer := mtop; repeat mprev := mbuffer; mbuffer := mbuffer^.next; until mbuffer^.count >= cnt; new(mnewbuf); newbuf.wrd := words; newbuf.memory := mnewbuf; write(temp,newbuf); mnewbuf^.count := cnt; mnewbuf^.onezero := 2; mnewbuf^.root := nil; mnewbuf^.next := mbuffer; mprev^.next := mnewbuf; end; procedure filegen; var cod: file of byte; ascnum: yarn; c: char; b, d: byte; i, shifter: integer; begin assign(cod,outname); rewrite(cod); reset(temp); repeat read(temp,buffer); mbuffer := buffer.memory; if buffer.wrd <> '' then begin for i := 0 to length(buffer.wrd) do begin b := ord(buffer.wrd[i]); write(cod,b); end; ascnum := ''; while (mbuffer^.root <> nil) do begin if mbuffer^.onezero = 0 then ascnum := ascnum + '0' else ascnum := ascnum + '1'; mbuffer := mbuffer^.root; end; d := ord(ascnum[0]); write(cod,d); shifter := 1; b := 0; for i := d downto 1 do begin if ascnum[i] = '1' then b := b + shifter; shifter := shifter shl 1; if shifter >= 256 then begin write(cod,b); b := 0; shifter := 1 end; end; if shifter > 1 then write(cod,b); write(#13,buffer.wrd:maxstr,' = ',ascnum,' '); end; until eof(temp); close(cod); close(temp); erase(temp); end; begin clrscr; writeln(' MAKETREE takes the word count This program may be copied freely '); writeln(' data created by COUNT and creates for educational use or to try it '); writeln(' a code tree using the Huffman out. If you like it, donations '); writeln(' algorithm. The resulting code tree will be accepted by the author. '); writeln(' is used by SQUEEZE when it This program is believed to be '); writeln(' compresses ASCII text files. reliable, but it is the user',ap,'s '); writeln(' responsibility to determine its '); writeln(' fitness for use. No liability '); writeln(' Mike Johnson will be assumed by the author. '); writeln(' P. O. Box 1151 Copyright (C) 1988 Mike Johnson. '); writeln(' Longmont, CO 80502-1151 All rights reserved. '); if paramcount < 2 then begin writeln('Syntax: MAKETREE infile outfile [tempfile]'); writeln(' infile is input file name (generated by COUNT)'); writeln(' outfile is output file name to be used by SQUEEZE'); writeln(' tempfile is filename to use for temporary file'); writeln('The temporary file is best put on a RAM disk in extended or expanded memory,'); writeln('but a hard disk or even a floppy will do if that is the fastest you have.'); writeln('Don',ap,'t use a RAM disk in conventional (lower 640K) memory, as this program'); writeln('already uses most of that in generating the coding tree.'); end else begin inname := paramstr(1); outname := paramstr(2); if exist(inname) then begin assign(infile,inname); reset(infile); end else begin writeln('Unable to open ',inname); exit end; if paramcount > 2 then tempname := paramstr(3) else tempname := 'ZYXWVUTS.$$$'; assign(temp,tempname); {$I- } rewrite(temp); {$I+ } if ioresult <> 0 then begin writeln('I/O error attempting to open temporary file ',tempname); halt; end; new(mtop); new(mbottom); mtop^.count := -1; mtop^.onezero := 2; mtop^.root := nil; mtop^.next := mbottom; mbottom^.count := 2147483647; mbottom^.onezero := 2; mbottom^.root := nil; mbottom^.next := nil; cnt2 := 2; writeln('Reading in and sorting word frequency data.'); repeat readln(infile,words); readln(infile,cnt); placeit; inc(cnt2); until (words = chr(255)) or eof(infile); close(infile); words := chr(255) + chr(255); placeit; words := words + chr(255); placeit; words := words + chr(255); placeit; writeln(#13,cnt2,' nodes to add to Huffman coding tree. '); fanno; writeln; writeln('Writing output file.'); filegen; writeln(#13,'Done. '); end; end. {$R+} {$M 2048,4096,655360} program squash; uses dos, crt; const maxbuf = 4096; maxstr = 15; ap = #39; type wd = string[maxstr]; yarn = string[80]; ptr = ^entry; entry = record wrd: wd; {Linguistic word} count: byte; {How many bits in code} code: longint; {Code (right justified in 4 byte integer)} next: ptr; {Pointer to next entry} end; var outfile, {File with output report} infile: file; {File with input text to be analyzed} temp, path, {Path of input file} filemask, {File name mask} filespec, {Path + name (including wild cards)} inname, {Name of input file} outname: yarn; {Name of output file} words: wd; {Word under construction} ch: char; top, bottom: array[1..32] of ptr; last, newbuf, newline, endfile, buf: ptr; mask, masked, place, long, cd, cnt: longint; infilepos, {Current position in input file buffer} inbitpos, {Current bit mask in input file buffer} inbytes, {Actual number of bytes read} outfilepos, {Current position in output file buffer} outbitpos, {Current bit mask in output file buffer} outbytes: word; {Actual number of bytes to write} cmdparameter, {Command line parameter number} start, {Starting point for command line file name scan} strlen, bite, b, i, j, k, m: integer; squish, filefound, done: boolean; outbite: byte; inbuf, outbuf: array[1..maxbuf] of byte; s: searchrec; {Record fill: array[1..21] of byte;attr:byte;time, size:longint;name: string[12]} function exist(filename: yarn): boolean; {Returns TRUE iff the file exists.} var f: file; begin {exist} assign(f,filename); {$I-} reset(f); {$I+} if IOresult = 0 then begin exist := true; close(f) end else exist := false; end; {exist} procedure getbit(var b: integer); begin if infilepos = 0 then begin if eof(infile) then done := true else begin blockread(infile, inbuf, maxbuf, inbytes); infilepos := 1; inbitpos := 1; end; end; if not done then begin b := inbuf[infilepos] and inbitpos; if b > 0 then b := 1; inbitpos := inbitpos shl 1; if inbitpos > 128 then begin inbitpos := 1; inc(infilepos); if infilepos > inbytes then infilepos := 0; end; end; end; procedure getbite(var b: integer); begin if infilepos = 0 then begin if eof(infile) then begin done := true; b := 0 end else begin blockread(infile, inbuf, maxbuf, inbytes); infilepos := 1; end; end; if not done then begin if inbitpos > 1 then begin inc(infilepos); writeln('Bit sync error in getbite!') end; b := inbuf[infilepos]; inbitpos := 1; inc(infilepos); if infilepos > inbytes then infilepos := 0; end; end; procedure putbit(b: integer); begin if (b > 1) or (b < 0) then writeln('Bad data passed to putbit.'); outbite := outbite or (outbitpos * b); outbitpos := outbitpos shl 1; if outbitpos > 128 then begin inc(outfilepos); outbuf[outfilepos] := outbite; outbite := 0; outbitpos := 1; if outfilepos >= maxbuf then begin blockwrite(outfile, outbuf, maxbuf, outbytes); if outbytes < maxbuf then begin writeln('Disk full error.'); done := true; end; outfilepos := 0; end; end; end; procedure putbite(b: integer); begin if (b > 255) or (b < 0) then writeln('Bad data passed to putbite.'); outbite := b; inc(outfilepos); outbuf[outfilepos] := outbite; outbite := 0; outbitpos := 1; if outfilepos >= maxbuf then begin blockwrite(outfile, outbuf, maxbuf, outbytes); if outbytes < maxbuf then begin writeln('Disk full error.'); done := true; end; outfilepos := 0; end; end; procedure flushbit; begin if outbitpos > 1 then begin inc(outfilepos); outbuf[outfilepos] := outbite; end; blockwrite(outfile, outbuf, outfilepos, outbytes); outbite := 0; outbitpos := 1; outfilepos := 0; end; function punct(ch: char): boolean; {TRUE iff ch is not a letter or chr(255).} var c: integer; begin c := ord(ch); if ((c > 10) and (c <= 64)) or ((c >= 91) and (c <= 96)) or ((c >= 123) and (c <= 254)) or ((c > 0) and (c < 10)) then punct := true else punct := false; end; function letter(ch: char): boolean; {TRUE iff ch is a letter A-Z or a-z.} var c: integer; begin c := ord(ch); if ((c >= 65) and (c <= 90)) or ((c >= 97) and (c <= 122)) then letter := true else letter := false; end; procedure parse(var words: wd; var ch: char); {Extracts a word or item of punctuation from input file.} var byt: integer; begin {parse} words := ''; while not (letter(ch) or punct(ch) or done) do begin getbite(byt); ch := chr(byt); end; if punct(ch) then begin if ord(ch) = 13 then begin words := chr(255)+chr(255); if not done then begin getbite(byt); ch := chr(byt) end; end else if ch = ' ' then begin words := ch; while (not done) and (ch = ' ') and (length(words) < maxstr) do begin getbite(byt); ch := chr(byt); if ch = ' ' then words := words + ch; end; if (length(words) = maxstr) and (not done) then begin getbite(byt); ch := chr(byt) end; end else begin words := ch; {return one punctuation character} if not done then begin getbite(byt); ch := chr(byt) end; end; end else begin if letter(ch) then begin words := ch; while (not done) and letter(ch) and (length(words) < maxstr) do begin getbite(byt); ch := chr(byt); if letter(ch) then words := words + ch; end; if (length(words) = maxstr) and (not done) then begin getbite(byt); ch := chr(byt) end; end; end; end; {parse} begin clrscr; writeln(' SQUEEZE is designed to remove This program may be copied freely '); writeln(' redundancy from ASCII text files, for educational use or to try it '); writeln(' making them smaller for transmission out. If you like it, donations '); writeln(' and storage. It also improves the will be accepted by the author. '); writeln(' security of files encrypted after This program is believed to be '); writeln(' SQUEEZING. It uses LANGUAGE.COD, reliable, but it is the user',ap,'s '); writeln(' created with COUNT and MAKETREE. responsibility to determine its '); writeln(' fitness for use. No liability '); writeln(' Mike Johnson will be assumed by the author. '); writeln(' P. O. Box 1151 Copyright (C) 1988 Mike Johnson. '); writeln(' Longmont, CO 80502-1151 All rights reserved. '); if (paramcount < 3) then begin writeln; writeln('Syntax: SQUEEZE a|s|u|x outputfile inputfile(s)'); writeln('a and s both mean add or squash, u and x both mean unsquash or extract.'); writeln('Input and output files must be different.'); writeln('Input file name(s) may include wild cards.'); writeln('LANGUAGE.COD (made with MAKETREE) must be in default directory.'); end else begin {Read in language code dictionary.} assign(infile,'LANGUAGE.COD'); reset(infile,1); infilepos := 0; inbitpos := 1; done := eof(infile); newline := nil; last := nil; endfile := nil; if paramcount >= 1 then temp := paramstr(1) else temp := 'X'; temp := upcase(temp[1]); if (temp = 'A') or (temp = 'S') then squish := true else squish := false; writeln('Reading in LANGUAGE.COD.'); for i := 1 to 32 do begin new(top[i]); top[i]^.wrd := ''; top[i]^.count := 0; top[i]^.code := 0; top[i]^.next := nil; bottom[i] := top[i]; end; last := nil; while not done do begin new(newbuf); newbuf^.wrd := ''; getbite(strlen); if not done then begin for i := 1 to strlen do begin getbite(bite); newbuf^.wrd := newbuf^.wrd + chr(bite); end; getbite(bite); newbuf^.count := bite; getbite(bite); newbuf^.code := bite; if newbuf^.count > 8 then begin getbite(bite); long := bite; newbuf^.code := newbuf^.code + (long shl 8); if newbuf^.count > 16 then begin getbite(bite); long := bite; newbuf^.code := newbuf^.code + (long shl 16); if newbuf^.count > 24 then begin getbite(bite); long := bite; newbuf^.code := newbuf^.code + (long shl 24); end; end; end; newbuf^.next := nil; if newbuf^.wrd = chr(255) then begin last := newbuf; end else if newbuf^.wrd = chr(255) + chr(255) then newline := newbuf else if newbuf^.wrd = chr(255) + chr(255) + chr(255) then endfile := newbuf; if squish then begin if ord(newbuf^.wrd[1]) > 107 then i := length(newbuf^.wrd) + 15 else i := length(newbuf^.wrd) end else i := newbuf^.count; bottom[i]^.next := newbuf; bottom[i] := newbuf; write(#13,newbuf^.wrd,' '); end; end; if newline = nil then writeln('Error: newline symbol not in LANGUAGE.COD! '); writeln; close(infile); {Initialize input & output files and global variables.} outname := paramstr(2); filefound := false; assign(outfile,outname); if exist(outname) then begin reset(outfile,1); seek(outfile,filesize(outfile)) end else rewrite(outfile,1); outfilepos := 0; outbitpos := 1; outbite := 0; if squish then start := 3 else start := 2; for cmdparameter := start to paramcount do begin filespec := paramstr(cmdparameter); filemask := filespec; path := ''; repeat i := pos('\',filemask); if i > 0 then begin path := path + copy(filemask, 1, i); filemask := copy(filemask, i+1, length(filemask)); end; until i = 0; findfirst(filespec, readonly + hidden + archive, s); while doserror = 0 do begin filefound := true; inname := path + s.name; assign(infile, inname); reset(infile,1); infilepos := 0; inbitpos := 1; done := eof(infile); {Perform compression on input file(s).} if squish then begin writeln('Compressing ',inname); ch := chr(0); while not done do begin parse(words,ch); {Find word or punctuation} if length(words) > 0 then begin if ord(words[1]) > 107 then buf := top[length(words) + 15] else buf := top[length(words)]; mask := 1; while (buf^.wrd <> words) and (buf^.next <> nil) do buf := buf^.next; if buf^.wrd <> words then begin highvideo; if words = chr(255) + chr(255) then writeln('Newline code not found in tree! ') else write(words); for i := 1 to last^.count do begin masked := last^.code and mask; mask := mask shl 1; if masked = 0 then putbit(0) else putbit(1); end; for i := 0 to length(words) do begin for j := 0 to 6 do begin b := (ord(words[i]) shr j) and 1; putbit(b) end; end; end else begin for i := 1 to buf^.count do begin masked := buf^.code and mask; mask := mask shl 1; if masked = 0 then putbit(0) else putbit(1); end; lowvideo; if words = chr(255) + chr(255) then writeln else write(buf^.wrd); end; end; end; mask := 1; for i := 1 to endfile^.count do begin masked := endfile^.code and mask; mask := mask shl 1; if masked = 0 then putbit(0) else putbit(1); end; end else {Decompress input file.} begin writeln('Decompressing ',inname); cd := 0; mask := 1; cnt := 0; while not done do begin getbit(b); if b > 0 then cd := cd or mask; mask := mask shl 1; if mask <= 0 then writeln('Error: code too long.'); inc(cnt); buf := top[cnt]; while (buf^.next <> nil) and ((cnt <> buf^.count) or (cd <> buf^.code)) do buf := buf^.next; if (cnt = buf^.count) and (cd = buf^.code) then begin if buf = endfile then done := true else if buf^.wrd = chr(255) then begin j := 0; for k := 0 to 6 do begin getbit(b); j := j + (b shl k); end; for i := 1 to j do begin m := 0; for k := 0 to 6 do begin getbit(b); m := m + (b shl k) end; putbite(m); highvideo; write(chr(m)); end; end else if buf^.wrd = chr(255) + chr(255) then begin writeln; putbite(13); putbite(10); end else begin for i := 1 to length(buf^.wrd) do begin b := ord(buf^.wrd[i]); putbite(b) end; lowvideo; write(buf^.wrd) end; cd := 0; cnt := 0; mask := 1; end; end; end; {Finish up or loop back for more work to do.} flushbit; close(infile); findnext(s); end; end; if not filefound then writeln('No matching files found.'); flushbit; close(outfile); end; end. {That's all, folks!}